Delta Lake and Tables Overview

hello and welcome to everybody on cloud fitness so in today's video we are going to talk about a very common topic of delta table so when we talk about delta tables in data breaks or when we talk about delta lake right most of us have a lot of questions regarding this what are the different features of delta lake delta table how we can utilize those features so i will be creating a series of videos on the delta table itself first we will discuss and we will see how delta table is what are the basic features of delta table and then we will discuss each and every feature of the delta table in detail in upcoming videos so without further ado we will directly jump on to the question what exactly a delta lake is so when you talk about a delta lake right or let's talk about data lakes first we have worked on lot of adls gen 2s we have created a lot of as gen 2 what is that that is a data link right and then you have something called as delta lake so understand both of these are very different delta lake is different delta lake is different so data lake is something uh as our adls gen 2 uh in azure and when you talk about delta lake it is a storage layer on top of your data lake which is in parquet format right so data lake you store your parki files you store your you know your csv file your avro files or whatever format you want to store your files in and then in in when you talk about your delta link it is again a storage layer but it is an optimized storage layer on top of your data lakes and when i say optimized it is optimized for a lot of processing how it is optimized uh it is a apache it is a snappy up uh like it is a compressed apache format basically it is a snappy parquet format and then it it uses basically parker format underneath and then it has something called as transaction log it has something called as your delta log due to which it uh you know it is able to behave like a normal tables your database tables so when you talk about your uh you know traditional databases in that you know you can simply query a table but when you talk about your delta data lakes you know if you have stored your 100 rk files in data lake so doing any merging operation doing any update you know updation of any kind of records right it becomes really very difficult doing your metadata handling doing your version controlling becomes really difficult but when we say that delta lake is an optimized storage layer on top of your data lake they are much more optimal they are you know reliable to do any kind of work which you can do on your traditional tables as well in fact they are much more faster also so delta lake is nothing but a storage layer on top of your delta lakes in a parque format right and when you try to store anything inside your delta lake right you store it in it in form of a delta table so when i talk about data lakes in data lakes you store data in any format which you want you store it in csv you show it in part k right but when you talk about delta lake you store it it in form of a delta table right but uh in terms of data breaks do remember that after runtime 8 and above you know all the tables that you create will be delta tables and you can create delta tables using python scala sql or anything right or like all the four languages which are available now coming on to the another part which is delta table features so there are multiple features associated uh with the delta lake right or the delta tables because of these features you know it is widely used technology now right it is asset compliant it is asset compliant just like your normal tables but they are not tables remember that they are not tables we will discuss about each of these in detail in my upcoming videos as well so they have something called as scalable metadata handling they can handle your metadata in a very uh you know in a very nice fashion and similarly you can do your streaming you can do your badge operations there is something called as schema enforcement you know time travel your upsers we are going to discuss in detail so no worry on this and in fact before moving on i would like to show you so i have created a playlist with the name of azure basic basics this particular playlist so if you open this particular place series in this i have mentioned uh you know um a video on data warehouse data lake and the lake house right so this is basically data warehouse data lake and your lake house platform lake house platform is your delta lake only right so i have explained in detail the difference between these three so this will this video will actually help you to understand the basics even better so do watch out this video but for now let's move on to uh you know um our exact hands-on so when i say uh hands-on so the very first let me zoom it actually it will be very much visible now okay so now the very first command i've already created a video on this i'm doing nothing but i'm just you know connecting my data bricks to the data to the data link so this is my data lake this is my adls gen 2 test 123 pi t this is the you know ads gen two account that i've created there's a container inside this container i have a test con so i in inside the storage account i have this container test container and inside this test container i have a csv file which i am going to read so this is the data lake right i have a data lake inside this i have a csv file now first i am connecting to the data lake and there are different ways to connect for that i have already created a video of connecting different different ways to connect adls gen 2 to databricks so that video would also be somewhere i think that video will also be in these data bricks hands-on tutorial tutorial itself you can go ahead and watch that out in case you want to get more information on command one now in the command two uh this is basically scalar notebook i am just trying to read the csv file so spark dot read dot format so this is a api spark read api spark dot read dot format csv csv is the file which i have option header equal to true so if i go back and open my csv file will actually see if i click on this edit option so there are around nine rows present and then you have the headers over here right so in the similar way i am mentioning header equal to two and this is the location where my file is present right so let me click on run so i have a data frame created where i am reading the contents of my csv file and then you know i can just display this data frame so the moment i display this data frame you can see the data is present here now i have uh you know the data in my data frame now i want to create a delta table out of it right now to create a delta table basically i have command 4. in fact this is the one of the most important command here df underscore data lake so this is the data frame where i am reading the contents of a csv file from a data lake dot write dot format delta so i am writing it in form a format of a delta table right dot mode overwrite so mode can be anything like it can be append it can be overwrite so it is a basic append and overwrite which we say and then you have save dot same now i am specifying the location that where i want to save my data table right so i i want to save my data table at this particular location so this is my container and inside a container so the part which i have highlighted so this is my data lake and test container is my container so go inside that container and load my table here right so zero one is already loaded so if you see the zero one is already there so i'm going i'll go and change the name over here so let me say youtube right and let me just run it now when i'm running it you will see my delta table gets created okay my data table is ready so i should go back when i go back over here right let me just cross and let's do a refresh now you will see that you have target delta table load underscore youtube so this is nothing but this is what is your delta table so if i go inside it okay now it doesn't look like a table but we call it as a delta table because it has all the features of a table right so it has a park a part file over here so this part file if you see in fact let me just move it here yeah so now you can see that snappy.parquet so as i told you that this delta lake is based on the parque format so anything that you create inside as a delta table gets created in form of parquet files so this is dot snappy dot parking so snappy is the type of compression so it compresses your file right so it compresses your file and stores it in it in parque format so the snappy.parquet holds your data now uh if you open this particular folder so there is something called as delta underscore log right so i told you that this delta table has transaction log present which makes what delta table is right delta table is all about this delta log right so if i go inside this delta log you will see that there is one crc file present one json file present so this is basically metadata you can say metadata about your delta table so if i open this json file you will see that it has the details like for example when this file was created who created this file you know what are the partitions and size of the file all the details about your files and similarly you have a checkpoint you know similarly you have this checkpoint file as well which also defines you know it it also has your metadata you know how many transactions happen what is the size of the file what is what are the number of files present so this is the heart of your delta table using this delta table can do anything so i will be discussing different features of delta table in my upcoming video so all those features will actually relate to this delta log only right now let me go back now since i have created a delta table now let's say i want to read from this delta table right i'll just copy the name of my delta table which i have created so uh let me just paste the name here so now you will see that this is i have created a delta table now if i want to read it read from my data table i can simply write spark.read.format read format will be my delta which is mentioned here and then the path so this is the path right so now if i want to read my file i will read it like this so this is how you read your delta tables okay this is how you can reach and now you will see that read underscore delta it is a data frame which is created and if i want to read ah you know if i want to display the contents i can display it like this so this is how i can read my delta table right and now similarly uh in fact let me copy the name again here okay so similarly i told you that uh you know it creates crc file it create json files right it creates a delta log so using that delta log it preserves the history so if i run this particular cell you will actually see that it has something called as version timestamp user id username so this is a delta table history so the delta table history is another concept we will talk more in detail in the upcoming videos just on this topic also so if you see uh this is written again in scala so delta table right so i am telling data breaks api that delta table read a spark delta table from this particular location right read a spark data table from this particular location and show me the history dot history so when i s say show me the history it will show you all the operations that you have done on that particular top on that particular table right so right now and it stores all those operations as a version so right now i just ran it once right i have done no change i just ran it once so the first version is version zero when did i do this at this particular date and at this particular time what is the user id what is the username what kind of operation i did i did a write operation right uh let me go up so on what cluster id i did what is the notebook id so all these details what are the number of files written so you will see this is the number of files written right what are the number of output rows so if you see i had nine rows right so it mentions everything so likewise if i make any change and again i run this it will have version one as well so it will store version zero version one so version control also it does it does it only because of that metadata or that delta log which is present right so this is also one of the feature which we will discuss in detail so let's say if i even if i want to you know um okay this part in fact i'll cover in you know another video it's it's it's a pretty long part i'll cover it in another video okay now this is the very basic of uh delta table which i have explained you so let me do one more thing let me try to read uh databricks default data set right so databricks by default it gives you default data sets which are present at this particular location we have been talking about this in our videos so this is the location now i have a default delta table present here right by default provided by databricks datasets right if i want to read a delta table i am again showing you how do we read it spark.read.format and this is the read format will be delta it is mentioned above and the path is this and then i'll just display the data frame right so this is how i can read a default data set so okay here there is a difference i am not connecting to the data link okay i'll show you what is the difference i have not connected to data late i am just reading it from some default data set which is present at a particular location which is provided by databricks now this is the data which i have read now if i want to i have read this data in a data frame now i want to write it as a delta table now i can simply run this let me just run this first okay so here i am writing uh out that data frame as a delta table now this delta table i have written i have written at this particular location now this location is a dbfs location now because i am not connected to the data link so i am uh you know i so i am just writing it in databricks default location so databricks default location is a dbfs location so i have just given it a random location and i have asked databricks to just create a delta table right now if you see you can you can you know just read let me just read it from here so uh if i'm doing db utils.fs.ls it is just used to list the files which are present at this particular location so this is the place where i wanted to create my delta table right so if i see let me just remove this delta log and again run it so you will actually see that these are the files present now if you go above what i have done people.write.format write format is delta partition by so i have partitioned it based on gender right and then right mode will be overwrite and save path is this path and now i'm reading this path when i read this path i have two partitions one is gender female and one is gender male right and then i have a delta log so same delta log right if i go inside this delta log let me control copy and then just put delta log here and let me just go inside delta log what is present let's see so again it has the crc and json files right we will talk more about in detail don't get confused that there are so many json and crc file if you run it multiple times it will capture the versions right and to capture the versions it will create crc and json files we will discuss about it no no worries on that part right so this is the delta log which gets created and then you have the data present in these partitions so now if i go ahead and if i try to list what is present inside this partition you will see the parquet files right which kind of parque file snappy dot parque these are the parquet files which are present these holds your actual data now remember one thing okay here i have specified the path i am creating a delta table i am telling data breaks that create a delta table at your default location but at this path right but now if i don't want to specify the path in that case this is how i will do that data frame dot right or people is my data frame dot right format delta save as table and i'm just giving the table name right i'm just giving it a table name that save it at this particular location but uh save it with this name i have not given it any location now this location will be handled by data break so that is what is called as your managed table in fact i've created a video on uh managed and unmanaged table let me just open this so if you see over here there will be an uh there's a video on managed and external tables right so you can go ahead and watch this video your concept will become clear now this will create a managed table so manage table means the table is created by data breaks the location is defined by the data breaks more details will be in the video so let me just run this and see uh okay so in fact i have run this earlier that is why it says that it is already created so let me just name it as 100 m right and then let me select star from 100 m so you will actually see that you know you are able to save the table even without giving a location now because the location is handled by data breaks and then if you do select star from people100m you will see that you see the data over here right and similarly if you want to drop um you know let's let me talk come here so if you want to drop the table you can drop the table directly here you know you can just drop this table 10 am or 100 m we created right so this 10 m is dropped now if i i don't want to drop 100 m because i want to show you something so i'm not going to drop the table right now but you can see over here that uh you know i've just run this command and i have dropped people 10m table right drop table if exist so you can even drop the table so your underlining data will actually get deleted now if you want to query your table right let me just keep it as 100m because what we have created 100m table now i have parameterized not parameterized but i have given it is a a variable name to the table name i have given a variable name now if i want to display the contents if i want to query my delta table how do i do that you know simple spark.table table name let's say people100m right this is my table name and select whatever you want any kind of query that you want and just run it right now you will see that i have created a table people 100 m and i'm even able to query it right so what is this this is basically querying your delta table right you can go ahead visualize your data as well you know let me just run this so basically uh let me just copy the data frame okay let me define so basically here what i'm trying to do is i am trying to create a data frame out of my spark table right this table name i have defined at the top so this table name is nothing but my people 100 m right now i am saying spa so my people data frame is nothing but my spark table now just let me run it so through in this data frame i'm selecting gender i'm select i'm ordering by gender right and then i'm grouping by gender i'm trying to do some count over here so just like this you can have data profile or you can also have your charts over here right so both the things you can go ahead and you know check the results or you can go ahead or you can select the charts as well so this is how you can actually query your data using delta tables as well right i hope you understood what uh you know i tried to you know i tried to make my point clear here but at the same time there are other features of delta lake so i have kind of given you an introduction of delta lake or delta tables but we are going to discuss about asset transactions on spark delta tables you know how uh you know delta tables takes cares of asset compliance how it can handle the metadata changes so now if your metadata is changing you have a file coming and you have created a delta table out of it now suddenly there is a change in the metadata now you want to push that push those changes or not you know that kind of even your schema enforcement as well right whether you want to do it or in which way you want to do it so those kind of concepts also can be easily handled using your data breaks delta tables similarly you have something called as time travel i have shown you that it version controls right so you uh so using that version controlling so let's say you have fired some command on your delta table now you want to revert it back so just because it has version controlling enabled on it you can go back in time and fetch the data which was present at that particular time right similarly absurd slowly changing dimensions these are very easy when we talk about you know your delta tables right and thank you so much for being till here do let me know in the comment section if you have any doubts any questions i will be happy to answer thank you so much

Transcript for:Delta Lake and Tables Overview

Transcript for:
Delta Lake and Tables Overview