hello and welcome back to the channel in today's
video we are going to be going through a project that is based on spark DBT and a cloud provider
this is as decided by the community and you can see how the community voted in the last week we
had 108 votes of which 43 of them voted that we go for spark DBT and any other Cloud platform as a
runner Hub we have a data warehousing project with best practices is following up with 38% snowflakes
16% and Apache Flink streaming project 4% I have decided that we're going to be going with the
winner of this poll which is the spark DBT and any Cloud platform project so let's go through the
system architecture as decided we will be adopting The Medallion architecture for this project The
Medallion architecture allows us to have three layers of transformation the bronze layer the
silver as well as the gold in the bronze layer the data that are coming into the system are left
as is and uncut while in the silver layer a level of transformation has taken place and some of the
data might have been changed transformed or even removed from the system in the gold layer this is
where we have all the data that are ready to be consumed by different business intelligence
unit different users as well as maybe for Analytics purposes even for consumers like data
scientists that want to develop data science and machine learning algorithms now according to this
project we will be using Azure data Factory Azure data breaks DBT and a lot of gen two services
in the ADLs section so this project is majorly on the Azure cloud and we'll try as much as
possible to keep it at that if you're new here don't forget to like And subscribe to the
channel and also turn on the notification Bell so you can be the first person to get notified
once there's a new video released so let's get into the project and let's get started if you do
not have an account I would really recommend you go on Google and search for aure free account if
you go for aure free account you have access to a free account which gives you access to free $200
and as your credit and you can use 55 services for free and it's usually for 12 months so if that is
something that you want to consider I will really recommend and suggest that you go and start free
right now okay I'll be using my brother's account it's a free account he open it a couple of days
ago and uh that's the one I'll be using okay so now to get started I'm going to start by creating
a resource so the resource Group that we're going to be creating start with the resource groups
so you can you can search from here or you can select from the list of asual services that
is listed on the UI for you I'll be clicking on the resource Group here and I have a couple of
resource groups currently here and I'll just click on create to create a new one so the subscription
that is currently having is aure for student so if you are using the free account you have aure
free uh subscription here so he's using aure for students so but the resource Group that is
going to be creating I'll be using I'll call this Medallion Medallion spark DBT Resource Group okay
now it looks like this Resource Group is okay and uh right now we're in the East us so I'll be
selecting something very close to me uh something around UK South if that's okay to Europe so you
can choose anyone that is closer to you and if you have any resource constraint and you cannot
create any of the services I really Rec commend you you change the resource Group uh region or
whatever resource that you're going to use to be using you can choose any Resource Group and region
for that so I'll be using UK Sal for that and in next I click on the next and the tag section
so I don't have any tags it's just to review and create and after the validation is passed if
you want to download the template for automation with something like terraform you can download
it from here but I don't need that as for this moment because it is out of scope of this video
so I'll just click on create okay so right now if you can if you notice we have a medallion spark
DBT Resource Group that is available for us now the next thing I want to do is to create a Gen 2
ADLs aure data link service for storing our data so I'm going to comeon here I'll search for um
gen two okay and which is the storage account or you search for storage accounts if that's okay for
you storage accounts so if you select this uh you have access to the ones that have been previously
created so I'll just create a new one I'll call this Medallion uh it's using the resource Group
that I previously created so I'll be selecting The Medallion spark DBT Resource Group and the
storage account name is going to be called um so let's call it Medallion I'm choosing The Medallion
architecture so that's the reason why I'm using The Medallion name so I'll be using The Medallion
architecture which simply is the bronze silver and gold layer so the the bronze layer like I said
is usually for raw data silver is for fine-tuned data but is can be consumed by some people and
some analytics and data scientists uh if you want but the good layer is where the refined data
and everything is more uh in a presentable format it could be data Mass it could be multiple uh
services that are going to be consuming from there but it's more like a very refined data at that
point so yeah so for the storage account I'm going to say medallion storage uh so it can only contain
letters and numbers I just say medallion storage account uh maybe Medallion Dash Medallion s say
okay as my solid account so the region is going to be UK South and I'll be using the standard uh
okay the G redundancies by default then in here I'll be just enabling the hierarchical name space
which is complement complimenting the D ADLs Gen 2 end points for enable enabling file and directed
semantics uh well that's English but usually you have access to the files and directory some just
where you have it in your my documents on your system where you have your folders and the files
uh structured out in in hierarchical order that's the way it usually is so I'll leaving the rest uh
as it is then I click on network I should enable access to public access from all Network the data
protection is going to be as it is I'll just click on next no encryption as it is by default leave
it then we just wait for the validation to be done and I'll click on create now the storage account
is going to be created and this is where we're going to be having the multiple storage whether
it's bronze layer the silver layer or the gold layer that's where we're going to be putting
them in our storage account so I just wait a couple of seconds for the storage deployment
to be done once it is done we switch over to that okay so the deployment is done I can go to
the re resource now going to the resource I have access to all of this in on the left and some of
the keys and shortcuts on the on the right here so but on the left you can see the data storage I'll
be clicking on the containers in there so I don't have any container right now so I'll be creating
one so I'll start with the brones bronze and then the silver and finally the gold okay now these are
the three layers that we're going to be having so whatever data that we're grating or that is coming
into the system will'll be landing into the bronze layer from there we process it into the silver
layer and then we process finally uh into the gold okay now once this is done the next thing is
to create a data Factory so I'm going to come in here and search for data Factory okay and you can
click on data factories in here now I have one in here but I'll just create a new one so I'll create
a new one I'll call it I'll be selecting the med Medallion spark DBT Resource Group and I'll call
it Medallion Medallion ADF okay then next uh the gate configuration is by default even though we're
not going to be using it we just leave it as we configure it later the networking is going to be
a public end point we leave it at that Advance no tax no review and create and that's it boom we
just wait a few seconds for the data Factory also to be created as well before we continue okay the
deployment is succeed I click on go to Resource and then we can we can view what we have in our
data Factory I'm not going to be launching the data studio uh data Factory studio right now so
I need one other thing I it's called key Vault so I can search for key volt in here key volt so the
key volt is where I'll be storing all my keys and secrets so I don't want to be copying every now
and then so I can just easily paste data into my secrets and I can reuse it in my data brakes by
the time we get to the data Brak section so we don't have to you know recopy every time you
know expose our keys as it is going to be in more like in a production environment okay so
I select the key vult now I have a key vult in here so I'll just create a new one uh creating
a new one I'm going to be selecting the resource Group medallion dbtg then the key volt name is
going to be Medallion Medallion spark uh call it KV okay now the region is going to be UK
South all right the pricing here is going to be standard I'll just leave it at that then next
access configuration um so I can use the rbac the RO based Access Control control but I can leave
it as a vae access policy where you can select a user so if you don't have it in here so I think
it is best to select something like this so you can find grain and uh select uh the policy that
you want for every particular user that is going to be assigned to so I already created a user
but if you want to create a new user you can use this button and uh if you want you can use
a template in here just select the proper that you're going to be attaching to a particular user
in here and then just the principal is going to be okay let's say I select all of this for instance
the principal is going to be the uh applications that are going to be accessing this in part this
particular key Vol so you can for instance if you have data bricks you can add it in here data
bricks resource provider can be added to this list and if you have any other ones maybe Gen 2 Storage
Lake is going to be accessing the key V for some reason reasons you can add it in here and any
other service so you have any application as well which is fine just review and create and that's
how I created this user and if you click on this user and click on edit you can see the properties
and uh permissions that that were given to this particular user I don't need any cryptographic
operations as well as privilege key operations and secret key operations I don't need any of
them I only need the key Secrets the certificate as well as the rotation policy operation those
are the the four things that I attached to that particular user now if you click on the review
and create you can see all of them in here but I will not be doing that right now once that is done
I'll just review uh I'll just next uh let's see am I missing anything here so we have the resource
access virsion machine and know resource manager for template deployment no this encryption no it's
fine you just click on next now allow access from all networks I'll be leaving it at that the
tax is free review and create and uh once it is validated we can click on create so it takes
a few seconds as well to finish the deployment and once that is done we can continue okay now
the deployment is done so let's switch over to our data Factory because that's where the major
work is so I I'll be selecting The Medallion ADF in here and I'll be selecting the launch studio so
this is going to launch the Azure data Factory for me okay so right now we are in the Azure data
Factory page now we can start with the one on the left so we have the own which is the landing
page of course author is where we're going to be creating our pipeline data sets and the rest in
here monitor is where we we can do the monitoring of the pipelines the triggers the chain data
capture and the rest here but right now we are going to be focusing on two things the author
section and the manage section now we start with the manage section because what we want
to do is have access to our Gen 2 Lake Service as well as database which we'll be creating
shortly so we don't have a database right now and we don't have any data as well so let's do
that before we continue to your data Factory and you know get cracking okay now going into this
guy we are going to do one last thing in here so you can search for SQL databases SQL database
this is aure SQL now mind you this is using SQL server command so if you have background in
postgress or any other languages like uh any other database scripting language like uh maybe
MySQL or uh or maybe Maria DB or something like that some of the syntax might have to be fine-
tuned but right now we're using SQL Server uh dialect as the language okay now I'll be creating
a new datab I click on this and then we can create a new database okay so um creating a new database
uh so I'll be selecting the resource Group as The Medallion spark uh DBT Resource Group okay
now the database name all right so this is very expensive so what I'm going to do right
now is 396 so what I'm going to do basically is to remove the one that I currently have on
the system so I'll just delete this uh database okay I'll just delete it uh just delete okay
it takes a few seconds to be completed uh and once it is done we can continue now as usual
a database resides on a particular server so we can come in here and search for SQL Server
you have SQL servers in here so I I currently have one server in here sitting pretty nicely
in here so if you want I can delete this and we can create a new one or let me see if there's
anything that we can adjust in here so basically uh I just created a Ser uh this server on the Fly
I think I'll just maybe delete this let's delete it um yeah let's delete it uh skip feedback
we're going to create a new one eventually so just delete this and uh delete okay I don't
have anything important on it so I just delete that okay so I close this going to SQL Server
refresh so yeah our SQL Server is deleted as well as the database itself so we can go back
into our SQL database in here so let's create a new SQL database all right so right now I have
access to 32 gig free of data and 33 G free of backup uh storage so I'll be applying this offer
so I don't pay so much on that so right now I'll be selecting The Medallion spark DBT the database
name is going to be code with you and I'll call the database name uh I'll say medallion DB okay
maybe I'll say Dev cuz this is semi- production already really so we don't have a right now so
we create a new server so the server is going to be code with you Medallion uh DB all right
so the location is going to be in the UK South uh where is it UK Sal Show Yourself uh I
can't find it okay UK yeah UK s okay now the authentication is going to be like that but
I'll be using the usql authentication so I'll be having the server admin as my in fact I'll be
using code with you as a admin login and the password is going to be something that I create
and I'm not going to be sharing that with you okay all right so you can use your own password as
well and don't share with me okay thank you very much now the password matches and I can click
on okay now this creat a server for us on the Fly and I know the username is code with you and
the password is something I'm going I'm not going to be telling you okay now yeah behavior when the
free offer is reached I I want you to autop pause the DB until the next month and uh it's fine I'll
just click on networking now connectivity method now I don't need there's going to be a public
endpoint access so I need people to be able to access access this not everybody it's just me
alone of course I'll be deleting this after this video so nobody can access this DBA as well but
it's okay firewall rules there no need to that no need for that as well as the connection policy
it's okay I'll just click on next as my security I don't need the defender for SQ no Ledger Ser
identity no and the rest is going to be left as default additional settings now this is very
important at the data source level I want I want this to be populated with Adventure Works Data
that that's already more like a sample database so you need to select the sample database for
adventure Works to be added to your system if you don't select that that means you'll have to
maybe download adventure works by yourself and then populate your DB uh with it if you want but
I think that's a stressful way of putting it so if you want I think it's best to just use a sample
and it gives you the latest Adventure works in the system you have all the data you know sitting
there pretty nicely once it is done so I just click on the sample and it gives us adventure
adventure Works LT I'll just click on okay and uh yeah the next TX is free and uh the review and
create I don't have any cost it's free of charge and this is important if you select just like I
had it before if you have your first 32 gig you have it for free but subsequently you can you have
to be paying right now I had $395 to be paying I I can't be paying something like that for for just
a single video you know and all that so I think it's okay if it is something that is more like
production ready then it's okay I just click on create all right so the deployment is in progress we just wait a few more seconds for the
deployment to be done and we can get cracking okay the resource deployment is complete
and we can go to the resource now so let's double check if the data is automatically populated and
we have access to it so if I click on the query editor with the preview it's it recognizes my
username so I can just type in my password so my password goes does and uh it says could not log
in with this IP address so I just need to allow it and it's allowed uh for Access I'll just click on
okay and login again so right now I have access to this and if you look at the tables uh I drop
it down you can see all of the tables in here so if I select startu from address for instance and
I run this you can see that I have um customer address sitting pretty nicely in here and for
the customer address as well customer address I can okay I'll just remove this and I'll remove
the last one here I'll run it again and you can see the customer address uh sitting there pretty
nicely we have the product as well and uh just to show you that the data has been populated for
us we don't need to write any scripts to or some fake uh data or synthetic data to be to it this is
Adventure works and it has some sales information the customer information as well as the product
information so this works for us but what we want to do basically is I want to automate the process
so I can automatically get the data into our bronze layer from there to Silver then gold okay
how do we do this so we're going to start with the ADF so the ADF is going to go into this database
on our server and fetch the table names in there so each of these tables are going to be looped
through and then we are going to dump all of this data into our broomes as our first level now
once data gets into our brones layer we are going to have them as uh more like pack uh more like
normal files okay then in our in in the second layer which is going to be silver and gold and the
second and the third they're going to be more like a Delta uh Lether files there we're going to store
them as pet files at that point but initially they're also going to be St as pet which is going
to be our initial files in the in the brones layer so let let me let me show you what I'm what I mean
so it's a little bit clearer if I show you okay so right now what we're doing basically is we have
all of these uh tables in here so if you want to get all of the tables you can get them this guy
I'll just copy this and I'll put this as uh like that and that will be information _ schema okay yeah all right so these are the tables that
I have so this is the schema name and the table name okay so you can see that we have uh let me
see if I can expand this a little bit okay maybe not so I have the table catalog the table schema
table name and table type and you can see each of the data that is here but what we want to focus
on are the ones that are in this sales LT schema uh we don't need the dbo which is fine this is
more like a added to it automatically so we just focus on this and we focus on the tables we don't
need the view as well okay uh so if you want you can have the views but for now we just uh exclude
those on so we just say where the table schema is equals to okay sales LT and uh table type table
type is equal to base table okay if we run this again we have something more find tuned which is
uh this so we can use this to get all the table names in there and based on this we can Feit
that uh with Dynamic query in our Azure data Factory all right let's switch over to Azure
Factory and let's get cracking with that so in our your factory in here we need to create a link
service that is going to be connecting to our SQL server and we can based on that create a data set
okay so let's create a new one the way I got in here by clicking on the manage uh on the left
on the Side Bar here and clicking on the linked service so if you click on plus and you come
to the asual sport as your tabs you can type in SQL and uh you have access to SQL database here
if you click on continue uh it's going to be a your SQL database whatever name you call it
I'll leave it at that and autor resolve runtime integration from a zero subscription I'll select
the subscription okay the server name is going to be The Medallion DB the database name is going
to be code with you Medallion DB Dev all right then the username is going to be code with you
the password is once is going to be what I can you know I'm the only one that haded that
okay once that is done now let's test the connection connection fail what's the error
it says user code with you client IP is not allowed so I think we need to allow create a
fire world that allowed this okay so let's fix that I'm just going to click on configure and
we have access in here it says public network access is selected Network so what I'm going
to do yeah so I just need to had the just I just scroll down and um allow your services and
resource access to this server and I just click on that and I click on Save once that is done uh
we can go back into our Medallion here and test it again test connection should have access now okay
connection is successful and I click on create all right now we have a link access from our data
Factory to AO right now we also need access to where we're going to be writing and dumping data
into on our Gen 2 Lake service so which is here we just come in here and select the ADLs Gen 2
continue and I can leave it as as as that and uh in here I'll just select that the subscription
the storage account name is going to to be The Medallion storage account and then the file path
we going to be storing data into it's going to be I'll just call it brones okay and we can
test the connection it's successful all right we create okay good now we can go back into the
author section and uh create our first pipeline okay so let's create a new pipeline now in our
pipeline I'll just minimize this for now in our pipeline we are going to have what is going to
be bringing data up for us for that we need a data set so let's create a new data set now
this data set is going to be the one reading the data uh from our Azure database into our
data uh data Factory okay so I'll just click on this and uh I'll just say SQL and uh we have
Azure SQL continue now the link service is this one and I'll call this uh table tables uh maybe
table squarey or something like that something to fet all of the queries but right here we cannot
select just a single table because what we want to do is get all of the tables so right now leave
everything in here as it is and I'll click on okay now this seems funny but this is how we're going
to be doing because we want to dynamically assign the table names into this guy okay okay so right
now we leave the table squarey as STS now let's go into our Pipeline and bringing our look up
that is going to be fetching all of these data for us so we can either come in here and search
for lookup okay but or you can just go into the general section and uh and drag look up into
the plane okay so in our look up we are going to have some settings in here so but for now
uh we click on the settings stab now I don't need the first row only I need multiple records so
if it's just a single record that I need you can select the first row only but I don't need that I
need the entire thing so we have a data set which is where the data is going to be coming from but
right now we not we don't have anything selected for our table query so what we need to do serice
a query that is going to be getting data for us I think I copied that earlier but for some reasons
I've overwritten that so what we're going to do is uh we can either go back into our DB or just
take our chances and write it in here select St from um I'll just get the database name uh maybe
I'll just quickly go back into our SQL database cuz I can recall the okay the exact query database
name and all that okay just go in here and uh we can select start from the database name I'll just
copy this guy and I'll put it in here information schema. tables okay and uh we just uh I'll reduce
this a little bit so we the table schema equals to sales LT and table type equals to base table okay
so let's see if this works I think it does and we can just select this I'll just copy that and go
into our data Factory I'll paste this in here so right now if you preview the data let's see what
we have all right so we have all of these table catalog table schema table name and table type
okay so we can use this to get started and load data into our data set so what we're going to be
doing is once we have this the next thing is going to be uh where the data is going to be going to
so the data is going to be going into um our DB table so okay so I I have it as uh tables query
so this look up is fine now each of these records that are being populated into our from our look
up uh from the base uh table query in here what we want to do is we want to Loop through each of
them and for each of them we want to dump them into our Bron layer so we come in here and we can
search for for each okay so for each is going to iterate through each of the items in there and I
can select this in there and uh I'll select this for each of the activities now the first thing
we want to do is copy the data that is coming in into the system okay so I can search in here
and going to the move and transform for copy data now you can add add many logic as well as poss as
much as possible in here but right now just keep it simple we have our copy data which is uh this
we just need to configure The Source the source is going to be the tables query that is coming in
uh let's see our pipeline in here so if you click on the for each here we have the settings now the
items that will be coming into the look up before we configure what is inside the forage we need to
have a dynamic content in here so we have the uh let's see what we have I'll just minimize this for
now uh we have the look up I'll change the name to fch or tables okay so that will be the name
and once it get into to the for each we have the settings and the items are going to be fetch
all tables uh return value so this is not the pipeline value but the value itself okay and uh
yeah we just use that now for each of the items that is now coming in as the table output value
we we want to pick each of them and dump them into um uh bronze layer okay so right now we have this
coming in and if you preview the data uh you can't because we don't have anything outside to that yet
for us to be able to work with this we can use the table query as well but what we need to do is have
a separate data set that is going to be feeding data from this table query so I'm going to just
create a new data set here I'll call it U still the same SQL and I'll select that and continue
and inside here I'll be selecting the SQL table instead of the data uh data table that we have
earlier the table list so I just selecting this as my SQL table and uh selecting that there's no need
for tables as well I'll click on okay now this SQL table is the one that we're going to be feeding
from when we are copying okay so I'll select this as my SQL table but in in here we just need to add
a parameter to specify the schema as well as the table name so I'm going to have the schema name
and uh the table table name okay now these two guys are the ones that are going to be passing
each of the records that are coming in from our table query into our SQL so we can use this and
uh dump them into our bronze layer okay so coming back to this yeah I think we're good yeah we good
so we just go back into the pipeline select this and instead of this we just have uh what's the
name SQL table so we have the schema name as well as table name so we since we are not at coding the
schema name and the table name we need to add them as Dynamic content so I'll select this and it
says for each of the item we will be selecting the if you can recall in our DB we have the table
schema and we have the table name okay so we have the table uncore schema and I'll click on okay
and for the table name I'll come in here for each again and I'll select the table underscore name
which is what is here I'll select okay now we have access to the table that are coming in from our
pipeline once we fetch all of them we pass them into our for each and for each is going to Loop
through each of these activities and and send them into our source the source is going to transfer
them to the sync so now let's configure the sync yeah everything think everything is fine here we
select the sync now the sync is where we are going to be sending data to we don't have where that is
going to be done right now so let's create a data set for that so I'll be selecting here and uh I'll
be selecting the gen two L Storage and uh that be the pet file that we're going to be sending to
that so this will be the pet file output okay and then the link service I'll be selecting the
data link storage I'll leave it as St but right now we we need to pass in the directory as well
as the the file name so I'll leave it as that for now and then we can parameterize this as well so
this is going to be taking the parameter of the uh of the the folder name oh yeah folder
name and the I'll call this the the file okay uh maybe the file name okay so these are the
two things that this requires as parameters and uh we have access to our data link from here now now
in our file path we are going to be passing that as Dynamic content so we have the folder name as
our directory which is okay and for the I'll just minimize this for the file name we'll be passing
the file name that is coming into the system as uh a parameter there so right now we have we
leave it as a compression of Snappy but if you want you can use gzip or the iso IO compression
Snappy in there and then this is how we configure the parameters okay now we can now use this pety
output as our output section in the sync so I'll be selecting P output and it's asking me for the
folder name as well as the file name now for us to configure the file name and the folder name we
need to understand what the final output is going to be looking like so based on what we have in
each of these records if you look at the say for customer for instance I'll just uh let's say we
have select St from customer and I run this okay you can see that we have uh we have some records
in here but what we want to do basically is group all of these as a single entity in the in our
brones layer so the date that we're bringing the data in is going to be where the data is going
to be sitting in and the file name is going to be separate okay so what I mean basically is I'm
going to be having a single date as a let's say um format date time I'll be using the date time
of current date which is the UTC now and I'll be formatting that as the year month and day if
you like you can use day month and year doesn't really matter but you know for consistency sake
you can use the year month or day or you can also add something like this if if that's what you
prefer okay so I'll select that and that that will be my folder name now for the uh the file
name we need to concatenate the schema name as well as the uh table name to create a an output
there so I'm going to have the the table schema which is this and then I'm going to be having a
join with uh because this is going to be a concat function okay I'm going to be joining this as
well with uh the item. table table name okay then finally I'll be joining this with the p because
that's the final extension that we're going to be storing so it's going to be schema name do table
name. pet yeah so if we have multiple schemas or this the same table name in multiple schemers
the schema name is going to to differentiate each of them as a a separate entity okay so I'll
select that and once that is done I'll select I'll leave the copy Behavior as as it is and U every
other thing is going to be left as it is so in here I'll just come into the uh yeah I'll leave
it at that right now we can test this out and see if everything is fine or we need to adjust
anything before we continue so let's validate this it says the copy data requires is table is
required for copy activity okay let's see why is table required for copy activity so let's see we
have the source which is passed in here we have it here but in our SQL table uh let's see what we
have we have the schema as well as the parameters sitting here so I think we need to fit in the
table here so instead of selecting this uh we need to pass in the parameters as our schema name
I'll be selecting the enter manually and then we can pass in the schema name as well as the table
name manually okay as a dynamic content brother okay I'll select this and this should be good so
we should go back into our copy data and validate again excellent so everything is fine right now
and we can test let's see if everything is fine and then we can debug okay let's debug
this and see if there's any error before continuing okay so you recognize and you can see
we have multiple copy data in here and uh once the data comes in here you can see the output
here you'll see the table catalog table schema table name table type and each of the tables
are here so we don't have to add code any of the table so I have that replicated multiple
times and uh beautifully you can see each of the coupy data uh is completed successfully
and you can see the data written is 35 kiloby the number of rows for that and each of them
has separate records so this is 847 record so let's double check that in our storage account
okay to see if everything is working as expected I'll just duplicate this and um we can go into
our base directory and search for the storage account uh okay I'm in the I'm in ADF so this
is ADF studio so I'll just pure there okay I'll just go to the the own page now I'll be selecting
the storage account and uh yeah we can go into to The Medallion storage account select a container
and in our brones let's see if we have data in there okay so no data is in here so something
is wrong okay even though our copy data is uh successful but it's not writing that to it's not
writing it to a file so let's see why so we have our copy data in here so if I select this and we
have the activity so it's fetching all of this output data and if you go into the copy data the
sync we have the table schema and the table name okay let's see why this is not working as expected
so if I refresh this I'm supposed to have data in this container ah but it's actually created but
it's inside a wrong Direct Tre if you look at what we have right now we have this here and you
can see all of them sitting pretty nicely here so let's restructure the uh where the data is sitting
in and then we can fix that so we created a new storage account for that but we don't I mean he
created a new folder for that but that's not what we want we want to put it inside the the brones
so let's go back in here and in our pile output so the file system we need to put that as uh as
our file uh our container name which is going to be the brones the current one that we have
right now that's what is missing I don't know how I missed this but it's interesting okay so
now going back in here we just have to R run this I just uh run this I need to delete this one I'll
just delete this and then delete it okay confirm it takes a few seconds for the pipeline running
to run to be done and we can continue once it is done now all the activities are done and you can
see if we go into the bronze layer right now we can see the folder of the date the current date
and all of the tables sitting there pretty nicely excellent so now at this point we've been able to
dump the data that we have in our SQL Server into our bronze layer now the next thing we want to do
is populate this into our data bricks notebook so that involves setting up the notebooks connecting
that to our ADF and populating the schema and this uh the database in our um in our notebook with
the dynamic content that we have in our ADF so let's quickly do that okay so can come back in
here and uh on our UI I just duplicate this and I can search for just above here I can search
for Azure data bricks Azure data bricks and I'll click on this guy so I already have a
workspace but what I'm going to be doing is creating a new one all right so I'll be selecting
the resource Group the medalion spark and the the workspace name is going to be the medallion spark
data breaks okay still UK South we be selecting the the standard and then select networking
encryption limit dat compliance tags like that okay and I'll just review and create takes a
few seconds to be done and I'll create a new one okay the resource deployment is complete and
we can click on this to go to the resource now we are in the data brakes environment I just
need to launch the workspace so it automatically logs me into the datab bras environment or the
datab studio if you like and we have access to this now good now the first thing we want to
do is uh create a workspace so right now we don't have a workspace so I can create a new
workspace or use this current one that we have but I prefer to create a folder just in here
so I can just create a folder and I'll call this my workspace okay let's say medallion
workspace Medallion workpace all right and I click on create now if you like you can put
this under a particular user or into the share folder I just leave it into the root directory
just for Simplicity sake now once this is done any notebook that we are creating is going to
be sitting here so I'll just uh right clicking here create notebook and I'll call this uh
I'll change the name to let's say base base notebook now now with this base notebook we are
going to be writing some scripts and queries in here but before we do that we need to set
up something called Secrets okay not so many people know how to do this so but if if you
want to have a secret you just need to add an ash key in here and call this uh Scopes and
uh create secret okay and this this should uh I think it's uh scope or did I miss it I think
it's secret let me see uh Secrets create scope I guess that's see that's secret slash create scope
all right now this is where we need to connect our data braks to our ADF okay so we need to get our
the DNS name that we're using and the resource ID that we're using for uh our data breaks into this
guy so we can you know automatically access the keys and we can use that to access our Medallion
information in here so to do that we just need to establish a relationship between our key Vault
and our data breaks so we have a key Vault that has been created already which is currently
sitting here if you go to the key volt you have our Medallion Spar key volt so let's put data
into our key volt and let's see how that works so I can come into the secrets in here and uh I'll
create a new one generate a new one so the man it's going to be manual anyways now the secret is
going to be the name and the secret value so the secret value that we're going to be putting inside
here the key that we use to that will be used to access our um our Bruns our storage account in
short okay so basically I go into this guy and uh in our storage account Medallion storage account
here I just need to come into the access keys okay and uh I'll click on show and I can copy this now
don't worry I need I'll just rotate this key once it is done but basically you come into the access
key here and get the access key and that is what you you use to create a secret as a secret value
here and you can call the name let's say storage account storage account key okay and the content
type is going to be optional you you leave the rest as it is there's no expiration date if
you want you can set it but I'll just leave it at that now you have a storage account key in
here that you can use just in case you have some challenges with uh the you know maybe listing
the records the or whatever if you not able to see the storage account ke that is created all
you just have to do is come back to the access configuration in here and uh you know activate the
access policy and go to the access policy add your user as a user to this particular access policy
which is currently here okay if you do that you should be able to access all of the records in
there but as it is we just leave it at that with the storage account key created okay now I have
my storage account key my key VT so we need to get this key Vault information into our data braks
okay how do we do that we go into properties here as our KEYT I'll copy The Vault URI and I'll go
into my data braks and the URI is going to be the data uh the DNS name okay and I'll come back
again uh not this one yeah I'll copy the resource ID which is currently here and I'm going to be
pasting it here all right then once that is done I just need to put it to give it a particular
scope name it doesn't matter what I give it but you should Al also have the the name or whatever
it is you give to it uh in mind so I'll call this data bricks data bricks Bri scope if you like
and um I'll create I say the premium tier is uh disabled in this uh in this workspace secret scope
can only be created with initial manage principle user so all you just need to do that big grammar
just changes to all users and you can create okay now the secret Cod has been created and I click
on okay now once this is done you can go back into your base notebook to confirm if this has
been correctly initialized and Order okay so to do that you just go into the workspace you
just go into the workspace in here and just do the GBU Tools fs. Mount you need to mount all of
our storage accounts into this for it to work so we have our source it's going to be I'll just copy
this so that I don't make mistakes uh just uh copy that and change the name to brooms and change
the storage account name to Medallion Medallion essay I'll change this as well to that and then
change the key which is what I have in here as my data bricks data bricks scope and uh I just
need to get the key name which is the key inside our Medallion key V I mean in the secret key
here I just need to get this storage account key okay and we can confirm if this has successfully
if we've been able to successfully Mount this we should be able to yeah we haven't created a
compute so let's let's create a compute okay to create a compute we can just come in here and
go to compute to create a cluster so I'm going to have a single node okay the single node is going
to have uh 14.0 as my data breaks one time and I'll be using the standard DS3 version two as my
node type and I'll leave the rest as it is uh yeah I'll leave the rest as it is as uh click on create
compute it takes a few seconds for the compute to be created and we can continue once that is done
while this is going on we can go into our notebook and uh you know talk through what we have in our
base notebook okay so what I'm doing basically uh we have two base notebooks right now so what
I'm doing basically is to mount whatever we have in our brones uh on our Medallion storage account
into this mounting point which is MNT brones okay so once we've been able to connect successfully
to our storage account using this uh key that is stored in our storage key in the in the key Vault
once that is done successfully we can access that in the MNT brunes and we should be able to see
every content of our Medallion storage account from this DBS okay so once this is uh I think is
still creating uh we just wait a few more seconds all right so the compute creation
is done we can go back into our notebook and double check this so if
you run this if there's any error we should be able to see here but if
there's none I guess everything is fine all right so you can see true that means it's
able to connect successfully to our storage account now if I want to see what we have in our
storage account I just need to do FS list and we can go into our mount for Bruns MNT Bruns and we
should be able to see that we see you can see we have 2023 1217 and the name of this directory
so if you go into this directory let's say 20 I'll just copy this and I paste this here and
I run this again you can see all of the records in here address customer address uh product
product category and the rest that means our connection to our storage account is successful
and is working as expected perfect so right now we need to mount in our silver and gold as well
so we just need to copy this a couple of times change this to Silver and change this as well to
Silver um yeah and this one to gold and this to Gold I just need to run this again but because
I already I've already mounted our our brones we need to comment that out and run this again
so we should have access to Silver and um the gold okay perfect so we have our brooms and
if you go into the silver you should be able to see what we have in the silver
nothing in the silver yet uh silver not silver okay you probably saw that I just
remount I'll comment this and remount the silver okay and run this again now it's okay and uh
if you go into the gold we don't have anything in there so we have everything fine so so far we
can see an empty array that means the connection is successful and we're good to go right now
I just need to delete this uh okay I'll delete this I'll leave this in a script section once we
done so in case you want to use the same code and you can just copy this and uh use it in your own
project as well now once this is done uh we can test something out as well before we continue
because this is where we're going to be putting all of this into play so I just need to don't show
confirm I just wanted to delete that okay so what I want to do basically is to double check if we
can create a database I'll just say spark. SQL and I'll say create database I'll call it testing
and if we are able to create this database that means we are ready to run our connections and
script so let's double check this in our catalog it says it's successful and you can see we have
a testing database so that means our catalog is fine by the time we run our queries we should be
able to see all of them so I just need to drop it again and we we're good okay perfect so at this
point the connection between our data bricks and the storage account has been established we've
also been able to run our queries to create a new database on our catalog on data breaks now
what we want to do is once we dump data into our bronze layer we want to create use those files
to create tables and you know the database in our storage account well not storage account per
say uh on our on our data breaks on our DB so we have a customer paret file in our bronze layer
we want to use that to create a customer table on our database account if that makes sense
so that's what we want to do basically so for that we need to initially have a script that is
going to automatically recognize the database if it has not been created before create it but if
it has been created skip skip the creation of the database and as well as the table so if you run
it multiple times you can only only create once you know the simple query is you know creative not
exist or create uh uh creative not exist for the database and maybe create if not exist as well for
the tables if you want so but that's basically the script that is going to be helping us to do that
so let's quickly go into that and continue okay so I'll just remove all of this and we're going
to be using the base notebook okay but before we do that let's establish a connection between our
data breakes and our data frame uh I mean data Factory not data frame okay so in here I'll just
create a plus and I'll select notebook here now the notebook needs to be configured we need to
be able to configure this I'll minimize this and I'll minimize uh this guy as well okay now in our
notebook I just need to come into the Azure data braks here we don't have any linked service
for our data break so let's create one so we leave that as your data breaks if you like you can
change the name now I'll come back in here for my autor resolve integration I'll leave it as it is
but instead of you know from the aure subscription uh you can use this but I think I'll prefer to
enter the connections manually so right now I for my workspace URL I can come into my base notebook
here and copy everything from net to the to the starting point I don't need the storage account
keys for now uh maybe I'll just leave it as it is now going back here where are we Yeah in our data
Factory I'll paste this here as my workspace URL now I'll be using access token to connect from
our data Factory to data breaks to do that that will take us to the user settings you go into the
developer profile and select the access token to generate a new token I'll call this adff in fact
I'll call this medallion Medallion ADF data braks okay and uh maybe you just use 30 days well this
is going to be deleted once the video is released anyway so it's fine so I'll just copy this token
and uh go back into my data Factory and paste this in here so that's the first thing know uh once
that is done it automatically uh scan through what we have but this is very important because
I made this mistake earlier so so if you select a new job cluster you have to be sure that you
have enough compute maybe you have a pay as you go account and you want it to be like each time
the job is running it spin UPS a new cluster I don't think that is the best approach but in some
cases when you have you know very huge processing to be done you might want to spin up individual
processing and I will explain why I'm using this so if you have a single some something like
myself if you have a a limited constraint on on the number of computes that you can use it is best
that you just select the existing compute that is running on your system and run whatever job on
that particular computer but in some cases you know when you are in a production environment you
want to spin up a new cluster but mind you once that is done it automatically uh spins down but
it costs you know to spin up every huge workload in in in production so I'll be selecting the um
existing cluster and then we use that so I'll be selecting the existing interactive cluster and
right now this should automatically populate um what we have in there so let's see I'll just
remove this and uh I remove the back slash from there and in our access token existing
interactive clusters we should be able to see uh so let's see what we have in here I mean uh let's see existing pool H this should be automatically
populated I don't know why it is not so what I'm going to do is uh discard the changes in here
and uh try to uh okay I'll leave it at that and uh maybe try to set it up again and then I'll
just click on this and go into the notebooks and uh select new again I'll be selecting the autor
resolve enter manually and I'll paste in our token then the data bricks URL I what I'm going to do
is I'll just keep the token here and in our data bricks I'll copy the URL from here and I'll paste
it here and I believe this should automatically load yeah so this is the new job cluster but
that's not what we want what we want is the existing job cluster that is currently running
which is this mujib Gan cluster so this is very tricky I think it's probably something that needs
to be improved on so the first time it wasn't loading the cluster that is currently running
I just need to close the current uh pop up and then reopen it once that is done it's loaded the
the cluster so you might want to double check if you didn't see it at first that's how best to you
know close the pop up and then create a new one once you select the mujib G cluster and you test
it out you can see connection successful I'm going to be selecting the create so that means once once
this is done I just quickly talk through that once this is done and you submit a job any job that
is running on our ADF is going to be submitted to this cluster using the the submit API okay so
that's what is going to be happening there so but what we want to do basically is uh we want to have
a notebook path that is going to be triggered once to submit a job to that so I'll be selecting
this and in our root folder I have a medallion work workspace and in my Medallion workspace I
have a base notebook this is where we this is The Notebook that will be triggered once we submit
a job to this particular notebook cluster Okay so so we need to pass in some parameters today so
we know uh the parameters that are going to be accessed uh by this notebook so I'll be adding
three in here so the first one is going to be the schema name which is the table schema and
uh I had the file name uh I'll call this the file name as well as okay I need to add the table
name and I need to have the file name okay file name now for the dynamic content since this is a
parameter so I just need to we are still inside it for each Loop don't forget so we can just pass
in the table schema here okay and for the table name I passing the table name and then the file
name I just need to get the file name and this is going to be automatic Al populated using the
format uh query that we had earlier which is the UTC now you can use any other name that you want
if you if that's what you prefer are using the year year month and day okay once that is done
I just click on okay and we should be able to trigger whatever it is in this uh Bas notebook so
let's populate what will be triggered by the time we submit a job to this particular notebook now
this is a script that I got online I'll just uh fine tune it for us to be able to work with that
okay so I'm going to be using the I'll be getting the file name which is going to be get gotten
from the DB utils uh the widgets and uh I'll guess this from the file name okay which is what
we have here uh this guy this guy the file name we do the same for the table name as well as the
table schema okay we come back in here I'll just copy this and repl replicate it multiple times
we have the table schema schema and we have the table name table name and maybe I'll just stick
to table schema table name if you like can use the underscore doesn't really matter so I'll just
uh table table schema and the table name because it's a variable name if you I just want something
that will be a little bit uh different see now we just need to create a database if it doesn't exist
so create database if it doesn't exist exist okay what we just do is spark. SQL and this is going to
be create database just like we did earlier if if not exist and the the schema of that particular
uh database is going to be the same as uh the database name so I'll call this table schema okay
table table schema all right now once the database is created we want to create the table if it is
not existing on the database let's create it so if uh if the table is not existing on the database
uh let's create it okay so to do that we just have spark. SQL and we can have our query in here
I'll just uh I'll use this and this is going to be yeah something like this we can put in our
query here we can say create table if not exist okay now the table schema that we're going to
be passing to this is going to be I'll just uh yeah terminate that and uh plus we have the table
the table schema and then do table name so that would be more like a a dot in here and then we
had a table name in here and this should give us the more like something like this create
table if not exist so let's say we have The Medallion as a schema name Medallion schema.
table table name something like this is what this is um going to be creating as well as uh
we're going to be using the pet that we dumped in our in our bronze directory uh we have the
not pest is pet then the location of the pet is inside the the month section which is the MNT
we we we now have access to that don't forget we have access to the bronze layer and we can
pass in the uh the P which is going to be p p okay that's our file name but we need to get in
the schema name and the uh the schema name the table name as well okay so I'll just uh this
is supposed to be three I'll just copy this all right in our brones layer here I'll just pass
in the brones I'll pass in the file path which is the file name okay and this is
instead of dots we have the file name in fact I'll just leave this
one as it is and I'll just add a new one yeah I'll just add a new one here
and the new uh the new one here here is going to be the file path file
PA which is the file name okay um we can have forward slash so this is going
to be Bronze in fact I'll just uh let me just copy this cuz it looks like um I might make a
mistake at that point yeah okay so this is uh is mounting onto brones the file name the table
schema as well as the table name okay so let's see if we are missing anything I just need to maybe
replace this Center section to avoid making errors okay just yeah so what is is doing is the table
uh if it doesn't exist the table schema the table name name and is mounting that into Bron the
table schema table name as well as uh uh the file name of course as the root direct which is
the year month and day that we formatted earlier and dumping that as a pet to use that it's going
to use this to create a table okay so that's what he's doing basically and that's all we need to do
all we just have to do is uh trigger the pipeline at this point to see if everything is working as
expected okay I just save this uh and I can go back into my data Factory and uh BL this again
okay if everything is working uh we have the base notebook we have the parameters that are
going in there and we can try and run this if there's any error we fix but for now everything
seems to be fine so let's see if everything is okay okay so it's loading the data right now uh okay yeah she's dumping them to our bronze
layer it has not reached the notebook section okay so it's sending data into our notebook
right now if you want to see how this is been submitted all you just have to do is go into
the base notebook here in our uh data braks and you see in the data engineering section you
see the job runs you will see the submitted data uh submitted jobs and if you see what is going
on with any of them it says the function has no attributes SQL yeah it shouldn't be SQL it
should be spark. SQL you probably saw that uh why didn't you tell me okay let's fix that okay
there's an error in here it should be spark. SQL all right I don't know how I missed that
it's just uh maybe I just need to rerun this again yeah we just that this looks like there's an other error in here
so let's see what the error is looking like Jo bronze yeah what's the error here again
it says name table schema is not defined I'm no longer using the small letter yeah
that's the error I just need to change this to I need to change this to table schema and table name something like that table name and
this table schema okay any other thing please tell me now before I go on and run
this again okay I just run this again to see okay so notebook is getting submitted and
uh being we shouldn't have any error this time around yeah you can see all of them are Su
successful and uh takes a few seconds to reflect here well all of them are successful regardless
so if you go into the if you go into the catalog you will see that we have sales LT in here as our
uh database name and you can see address customer address every other thing in here so if you check
the customer information you can see the sample data being populated using the pet information
that we that was dumped into our bronze layer as well as the customer address if you look at
that the address itself look at date product and yeah everything is fine and uh everything is good
so what we just need to do now is go proceed into our DBT section right now we are completed the ADF
Azure data braks integration we've been able to move data into our from our SQL using ADF and then
dump it into our data brakes what we want to do right now is set up our DBT for the transformation
as I explained earlier DBT is for the transformation layer is really not concerned with
data gets into the system or all data is being uh man the only thing is that is concerned with
is the manipulation of the the data so whatever data that you have in your DB can be you know
transformed and manipulated and saved into you know other dimension and other firms that makes
sense okay so let's switch over to the DBT section and uh continue from there so I'll be creating
a new project uh I'll be using py Cham for some reasons py Cham is now my favorite ID for python
okay so but you can use any other IDs that you like if you if you prefer Visual Studio code or
any other you knowbe whatever ID you like you can just use okay so right now we'll be us we stick
to The Medallion I'll call this uh DBT I'll call this spark Azure Deb okay maybe we just leave it
at that maybe there's no need for as but who okay I just leave it at that yeah um I'm using python
3.9 anyway so if you want to use the same thing you should use you should be using python 3.9 as
your uh python uh environment so if you like you can change it to something more complex maybe
or the lat latest one 3.11 uh if you like but I think uh to prevent any issues and errors that
may occur from you know updation of dependencies and all that I'll just stick to 3.9 okay but I
think with time all of those dependencies will be fully integrated there will be no errors and
also so you might want to use the latest python environment so we just wait a few seconds for
this to be instantiated and once that is done we can get started okay so the first thing that I
need to do is uh let's let's get a new um python file into this okay I'll just touch touch main. py
and let's uh see if we can increase the size of uh let's see if we can increase the size of this
ID to 20 I hope this is okay 20 should be good all right for clarity okay so I just need to do
a pep install Deputy data braks now this is going to help me bring in the dependency of data breaks
and DBT all coined together and wired up into my system so that's the first thing that I need to do
so once this is done I need to do one other thing before continuing okay I need to ensure that my
system can connect successfully to DBT um to data braks and now do I do that I use the package that
has been released by the data breaks team so I'll be using p install uh the name of the package is
data bricks CLI okay so I'll just install that as well and this is going to allow me to have
access to data bricks from my system now once this is done I'm just waiting for the indexing to
be done because it slows down once uh it is not is still in progress okay just with a few seconds now
the indexing is done the next that I want to do is configure my data braks uh on my local system so I
want to ensure that my system can effectively con connect to data breaks uh that I have on the cloud
okay to do that I just need to do data breaks data bricks uh configure and I need to pass in the flag
of token now this is where the host is going to be so the host of my data bricks is this guy okay
this this guy here this is my host my data bricks host without the backwards the forward slash all
right now the token was the one that I created earlier that I pasted here I'll just use this
token here and I'll paste okay um once I paste it I press enter now we should want to test if I
have access to data bricks itself or not so data bricks and I'll use the secret list scope and if
I'm able to see this that means I've been able to establish a connection to my data brakes and
this is the key even that I created on on data bre now the next thing I want to do is check if
I can list the data uh that I have in my catalog okay data breaks and I can do FS list now if I can
see this that means automatically my solution uh my system is able to effectively connect to data
breaks so if you have any challenge here you need to fix whatever error that you have before you
continue because at this this is a very critical point for data DBT to work uh on your system as
well as uh to connect to as your data breaks okay now once this is done the next thing is to do
DBT in it okay in it so once I I I press enter it's going to ask me the name of my project and
all that uh blah blah blah information so what I want to do basically is for the Project's name
I'm going to have it as a medallion Medallion DBT spark okay and uh I press enter it said it's
not a valid project name letters digit underscore so I just have Medallion DBT spark okay now which
database would you like to use I'll be using data bricks as my database so I'll select one now
the next thing is going to ask me let's see now it says what's my data bricks host my data
bricks host is uh let's see what we have as my data breaks host which is this one isn't it datab
brick.com yeah that should be it if there's any error we fix this once we are not able to connect
now the HTTP path if you want to uh fix this or to get this information basically you just need
to come into your compute and select your compute select your compute here and in the advanced
option you see the jdbc odbc here so my server host name this is actually my server host name
I think is the same one I have here but without the htps I'll fix that shortly but the HTTP
path is this one I'll copy this and pass I'll paste it here now it says I'm going to uh desire
access token option I'll be using the use access token now the access token that I copied earlier
is here I'll just copy that and paste it here I don't need to use Unity catalog it's okay and the
schema that I'll be using the default schema that I'll be using for this uh automation is this guy
sales LT okay sales LT all right uh one is fine I don't need to do that it's okay uh so final
thing that I need to do is I need to bring in my profiles yo and fix this HTTP that I added
here because if I do DBT debug I'm going to have a challenge there's an error that is going to
be generated at this point so I need to fix that error it says it is not valid so let's fix that
okay to fix that we just need to come back into my uh into my provid Y in here and it's
usually here I'll just uh get it here I'll just come in here to my DBT profiles.
yo and I'll drag that into my Python and you can see we have the code with you
DBT and this is uh DBT B query C demo crash course Medallion yeah this guy now this
is saying we don't have a cat log but we do actually so we don't have a catalog here the host information is wrong I'll
just remove that and uh what I need to do is move this back up I'll just move
this to the top because that's what I want to work with at this point and uh
yeah I'll just save this and do DBT debug again uh it says there are errors with
this so what I'm going to do is I'm just going to comment these people out
they are causing issues for me DB debug I says profile should be known
if loading profile is completed all right um so what I'm going to do as well I'm
just going to I'll just remove all of this for now so I can establish connection
and I can paste them back once this is done uh you can see that uh the connection
test is fine but it says what's the issue it says Medallion spark DBT project is no fun but
because I'm inside a wrong directory so I just need to move into The Medallion spark directory
and I can run DBT debug again and we should be good all things been equal perfect so everything
is fine I'm able to connect to the uh data bricks with this guy which is okay I'll just paste
these guys back and uh I'll just leave them as as they are okay in fact I'll just comment
this guys out okay good now going back here in my medalion DBT Park I have my DBT project toy
and right now when I run the DB DBT debug I was able to successfully connect and you can see the
catalog has been resolved as I have met store now this at this point is where we need to write
some SQL queries for the transformation so I'm going to start with the snapshots to load in all
of the current snapshot of the data that we have right now if you look at our sales LT we only
have these tables we don't have any snapshots at this moment so we need to keep a snapshot of
this current directory before for transformation so we can replay all of them at any point in time
we want okay so what I'm going to do is I'm going to go back into my uh Medallion spark here and
under this snapshot I'm going to create a new SQL file I call in fact I'll just say touch
snapshot uh each of the each of these tables are going to be handled as snapshot separately
uh so I'm going to start with the address so I'll say address. SQL I'll have snapshot again
I'll have customers. SQL customers. SQL and I'm going to have uh yeah just going to the snapshot
directory and I can just have touch uh customer address customer address SQL I have my products
SQL I have product category product category. SQL product description product description SQL uh
product model product model. SQL we have product model product description SQL we have sales
order sales order detail tosql and we have sales order Eder sales order ed. SQL those are
the sqls that we are going to be creating and if you look at the snapshot section you can see
all of them so what I'm going to do basically I'll just copy paste all of these scripts I have
written them and we can just talk through each of them and what they do basically okay because
it takes some time to to write all of them okay so I'll just paste in the address I'll paste
in the customer information the customer as well as the customer address I'll copy that
and paste that and we can talk through all of them we have the product uh
product we have the product category I don't have that um I removed this so
I'll just use the product model um the product description information there and I'll just
focus on this ones yeah so if you like you can you know populate all of them but for
now I'll just stick to the ones that have uh that I've been able to do I'll remove
this category and description as well as this guy as well okay so right now I have
address customer customer address product product model sales order detail sales other
Ed so okay let's talk through each of them so let's start with the address so what
I'm doing basically I thought I ped the address okay so what I'm doing basically is I'm
creating a snapshot of the address snapshot as a variable name that's the name of the snapshot
now I'm using the configuration of the file format is going to be Delta like I said our
silver and gold are going to be Delta but um but that's not the same for uh our brones
brones is a simply pet okay right now we have a Moun Mountain Point as MNT silver and
the address so this is the location where the data is going to be dumped into on our uh on our
on our what's the name on our storage account it just quickly escaped me okay so now we have
the target scheme as the snapshot as this as the target schema invalidate delete unique ID
unique key for the address Key address ID the strategies check and the columns is going to be
checking all of them so what I'm doing basically is to select this colums from Salt sales LT from
address so the address that we have in here this guy this address I'm selecting uh no not this one
I'm selecting all of this information I mean those that I selected here I'm selecting all of them and
I'm selecting select St from that that's basically this is a CTE expression and also running this
and that's all that's what I'm doing and the same thing is what I'm doing for customers as
well in our snapshot the customer address as well we have the product the same thing product model sales order as well as
say all that details so let me just take water now I'm selecting all of this
information okay from the sales order Ed as well sales order details is exactly the
same thing it's just the column name that is different as the as well as the name of the
snapshot okay once this are done I think I'll just this is a good point for us to commit
whatever we have here and uh commit into our repository on GitHub okay so if you look
at what we have here we have we have the everything inside our snapshot uh okay so
I think I initialized the gate in the wrong directory yeah I'll just remove the dot gug here it should be in the root directory okay
get in it and I just need to check again so I have all of this good so I just need to add my
my everything just I'll just add everything to the DB at this point okay I don't need this main
main py again so I just say GAA and uh commit see added snapshot uh just
remove this added snapshot to DBT okay added snapshot to DBT okay now once that
is done we can run this and see if everything is fine working at expected so just just do
that by running the DBT snapshot and running DBT snapshot is going to run each of the queries
that you have and see if everything is running as expected it says snapshot depends on sales sales
lt. customer which is not found okay uh okay so we need to fix the models so let's fix the models
and we should be good okay now for the sales LT actually I'm missing one other thing here that
is important I only pasted all of this code I haven't added the source which is the where we
recognize all of the tables and the schema that is on our Azure data bricks that we want to
model on DBT yeah maybe it's because I needed water then so let's fix that uh we just need to
get in our yaml that is going to be recognizing all of these tables as well well as the the
schema okay to do that we just come into we come into our model here uh we can just call
this uh we create a folder we can call this a staging directory uh or if you like you can call
it brones whatever it is but basically this is the yaml file we have so we can say maybe we can
call this brones yo okay yes and so basically in here we are going to be having version we have
version two as yam file so we have our sources the name is going to be let me just paste and
copy paste that to save time okay so basically what we have is uh the name of this schema is
Sal LT and the sclt schema and this adventure Works loaded into brones whatever description
you want to put you want to give it so right now we have address customer customer address product
product category description model sa details and say so Ed as our sources and once this is done you
want to check if everything is fine debuted the BG okay so everything is fine let's check if we
can do DBT snapshot again hopefully we should be good yeah you can see it's unable to pass unable
to do partial passing because saved manifest not found so it's doing a full pass and running
everything from the beginning and you can see each of the snapshots are being run at this point
so we have the address snapshots being run first so all of the seven snapshots are going
to be running and uh we in the three of seven okay perfect it's completed successfully
so if you look at our sales LT right now and you refresh you will see that we have uh we
have our snapshots created and these are the seven tables and if you look at each of
them they have a snapshot with addition of DBT updated DBT xci updated at Value from
and value two right these are very these are important keys for us to know the changes
that are happening and what time each of these record is invalidated so for instance
if you change or remove or deleted whatever record that is here the DBT value to will be
populated and you will see the last record valid date uh before it was changed okay and
that's that's one of the good thing about DBT it keeps track of the history of each of the
data and this is our data is pulling data from P okay so that means uh at this point our snapshot
is fine and if you go into our storage account uh just go into the containers and you can check
in the silver section you can see all of them sitting there pretty nicely for our address
address snapshot is here and uh if you look at the customer we have the customer snapshot
sitting here pretty nicely and uh each of these can be used to replay you know if if anything
changes or whether for some reasons which we lost any data in our production environment we
which is maybe the gold or the silver section we can replay whatever we have in our silver
in our bronze to the silver then to the gold and everything will be sitting there pretty nicely
okay so if you have any transformation to be done I think this is where we are going to be doing
the uh more like the major transformation okay so I'm going to be deleting this example file
I don't need it so I just need to have in our model here I need to have a match I'll just uh
add our mat here so this match is going to be more like a streamlined version that will
be going into gold so this are going into whatever snapshots we have here are going into
the silver section uh and like I said you can do any transformation in here you know change
the column name merge whatever it is you want to do but these are going into our silver section
for the ones that are going into the gold which is the final stage in our transformation layer
we are putting them inside the mat section so we going to have for mat we are going to have
customer M customer and uh let me just uh you know make directory we have a model mat and
we have a customer we are going to have okay it already exists we have uh products and we
have sales okay sales now the three of them are created so we just need to create a file
so we just need to do models M sales. we do for customer and we do customer customer. as well
as uh the SQL version of each of them I just need to do for product as well products. emo so we do
products. SQL custom SQL and uh sales. SQL okay so if you look at each of them you can see for we
have the SQL in here so the transformation for the SQL for the that is going into the gold will be
inside here but the structure and the test which is the yaml file just like we have in our if you
look at our snapshot here okay we don't have the structure here I mean if you look at the okay we
we create one don't worry uh with already deleted the examples but inside the yaml file is where
we have our structure of the table and maybe the test for each of the columns that are going
into the table but inside our SQL that's where we're going to be having the Transformations
that are going to be happening okay so let's start with the customer I already have this
this written down and we can just quickly talk through it um just space this here but I'll
change the name from customer to dim Customer because that's more like a if you like like
you can put it like that I'll just change this um yeah do a refractor it's okay and do a
refractor for this as well read customer so if you like you can you can write your FX
table here or you can write the DM I'm using the DM here but you can use whatever it is
you like for some reasons I I just left it at that but if you like you can I'll just uh
rename this to leave the sales as it is so we can you know work through each of them now
so what we I'm doing here basically in our dim customer uh so I just have this and if you look
at what I have in here so this is going to be a this is going to be materialized as a table
that means it's being written as a table not a view okay and um the file format is Delta
and the location is going to be the mount go customers okay that's where it's going to be
written to and this is the address snapshot uh this is this are the context expression the
CT that I written so this is getting data from address this is getting data from customer and
this is getting customer address this is getting from customer itself okay and at the end of the
day this is the transformation that is happening this is just basically a select statement that
is happening between customer snapshot customer address and address snapshot I just join all of
them and then I join and you know select the uh select the columns but you can you know adjust
this and do multiple transformation I'll just paste in the structure and the text that I have
in here I'll just uh paste in here simple as that and for the product as well I'll copy paste that
let's just paste the SQL and paste the structure as well for product and uh I'll paste the SQL for
sales and uh for the sales structure as well I'll just copy paste that in here good so to just talk
through this we have the product uh this is also going to be materializing as table the data format
and is going to be to product so I'm selecting the product Snapshot from our product snapshot that
we had earlier the model snapshot as well is going to be uh selected as well from there and if you
notice something here we have the DBT valid to now dbut value two is uh showing us a new record
okay so far it's n it has not changed so but if you have multiple Transformations that have been
done on a particular column maybe you've updated it you've uh maybe removed it or deleted it or
whatever it is or whatever any of the columns have been modified you see the valid two and a new
record is going to be added you know without the know so we are selecting the latest one without
the valid two information which is the time stamp that is added at what point it was changed okay
that's important okay so right now at the end of the day we have our transformed section and
we selecting that the same thing we have the structure and this is the structure we have the
D product and and uh everything here is as it is uh we have the sales as well which is here uh the
same thing happens here we selecting the snapshot the product and I'm joining the other as well to
the to the dance and then the transformation layer I just selected all of them and I write them I
select them as uh transformed and that's basically what is happening there all we just need to do
right now is run this and we should be able to see everything see sitting down in our Gold section
and we can write a queries against that if that's okay so let's do that and check so in our terminal
we are okay so where are we so uh PWD so we are in the medalion DBT spark I just need to do DBT run
okay but before that I need to do a test to see if all the test that I've written in the schemas I
mean in this yl files are working as a expected it says configuration Parts exist in your DBT we do
not apply to any resources which is this DBT spark example so let's remove that if you look at this
DBT uh project in here we have our example here so we just need to change that to I'll just
remove everything you can add each of these folders if you want them to materialize as table
of view you can add them in case you are wondering how to you change that or you know modify that I
have a crash course if you check the link in the description or the card on this video you will
see the video there that is been suggested to you so you can watch the video and you'll see
how I converted the video I mean you see how I converted a table to view view to tables and
how you can you know work individually with the multiple nested folders inside each other so let's
not go into that that's beyond the scope of this video but let's leave that as it is we just remove
that we do a test again DB test and we should be good okay it says uh it's trying to run yeah
because we haven't run it so let's do a DBT run so so you can have all of these registered
in the system and we can test all of them okay so it's trying to run all of these test
on them but there's no deem sales in the package itself so it's trying to run each of
them as a table and once that is done we can now do a test even though the test is being
done uh in the background but you know it is important to do that separately so right now
we can now run our DBT test and we should be good okay so it says found
three model 7 snapshot and nothing to do try to check
your config and model A so I think everything should be okay so
let's check our UI and if you check our catalog in here if I do a refresh and
if go into the sales LT section you will see that we have dim customer which is
here and this is the final look of how the customer looks like the same thing
with product you see how it looks like and you can see something like the sales end
date discontin date and something like that and the last one is the sales which is
here the sales is the sample data here and it's crunching it here and it's done so
if you look at what it looks like we have it like this and uh that's the thumbnail
photo and uh each of the records are here and the final comment so this is how
it looks like and uh what am I doing yeah this is how it looks like and these are the
columns uh the column information the types and all that and if you go into the gold section at
this point I just need to go one step up into the containers and if you go into the gold section
you have the customer product and sales and if you see the customer you have the DM customer here
if you have any other Transformations they can be done via the DBT product level here as well as
uh the sales here okay all of them are there all right now finally the next thing we want to do is
just generate the documentation and we can view it looks like how the structure of the system
looks like so we just need to do DBT docs uh DBC I write it somewhere here uh just this
docs generate uh DBT docks generate okay this is going to generate the documentation of
whatever transformation that has been done so far and once this is done we just do DBT s to
view it on the UI okay so it's written to this catalog. Json in the Target directory and we can
do DBT docs serve okay and it looks like this on the UI I'll just uh drag it into this and you
can see how it looks like okay okay so this is how the the documentation looks like so we have
the the sources here uh we have the The Medallion DBT spark here and this is the model the mats the
customer the DM the dim customer here and this is the code uh if you look at the description here
is not documented because I didn't do so much documentation there and these are the columns
the dependencies which are these three guys and this is the full code all right the same thing
happens to the product you will see the details here and the code here as well the sales you see
it here the snapshot and you can see of the data and W is referencing what and now each of them
are being referenced I think for the customer if you look at this you see the D customer is
referencing that and it has a node of customer okay and if you look at the sales it here which
are the source all right you can see each of the data and the sample SQL that are select that are
being selected so this is selecting data from our hi met store and our Azure data bricks the same
thing with the address here you can see it's the uh the snapshot and this is the query that is
that is used there if you look at the database if you look at it on the database level it looks
like this I mean sales it and the snapshot which is exactly what we have in our c l in here okay
the same thing we have here sales it sales LT and snapshot the same thing just like it this the same
way here with the exception of the additional uh something like this uh product model product
description table that wasn't added to that so but basically that's how it looks like and I
don't have any group models in here but that's how you can see it on the pro project level as well
as the database level so if you want to view uh you can see it from there if you want to see the
graph the lineage graph for um PR category looks like this but the one that is a littleit more
interesting is if you look at the sales let's look at the dim customer yeah the dim customer
has this uh graph which is the sales LT into the address snapshot sales customer into the customer
snapshot everything going to the the customer so the same way if you want to view the sales for
instance we have this and it's like this if you view view screen we have the sales order details
into the snapshots then we have the the product and the order in here everything going into the
sales all right so that's how the the graphs look like and uh what else again I think there
is another one we have the product do we have an interesting graph I think it's just a simple
graph really we don't have anything to cost uh to complex but that's how it looks like anyways so
that's basically that's basically it if you have any questions or comment feedback just drop them
in the comment section so that caps it all that's where we are going to be stopping for our DBT
but before we continue if you enjoy this video and if you learn something new please give a like
subscribe and turn on the notification Bell so you don't miss out on latest information so if you new
here don't forget to subscribe before going and you can check videos that are going to be listed
on this UI for more information and more contents like this all right so thank you very much
for watching I'll will see you in the next one