Transcript for:
Delta Table Management

hello and welcome to everybody on cloud fitness so in this video i will basically continue on my series of delta tables and in this particular video we will talk about time travel and versioning in delta tables we will see how delta table maintains history so let's move ahead and what you see on the screen right now is the transaction log is a snapshot of a transaction log which basically i showed you in the last video and we will be seeing it now now as well so when you look at this transaction log so this is the heart of the delta table right so whenever you create a table let's say my underscore table so this is a delta table that has been created so under that table you have you know parquet files you have your snappy parquet files and inside that you also have underscore delta underscore log which contains the json files these jsons files are also called your transaction files and you have the checkpoint files as well right so this is the structure of your transaction log which holds all information about your about your delta table so if you remember so this is the delta table that we created uh in a target delta table load underscore youtube so this is a delta table that i created in my past video where i explained you what exactly is a delta table and what are the different features of the delta table right that time i showed you this delta underscore log where you have the crc and the json file right so now in this particular video we are basically going to see in detail about the um maintenance of history and how does delta table maintain history right so in this video uh let's uh most probably you know i'll be working around the same data set and i'll be showing you step by step how uh you know you know history helps you you know to go back in time even to how to roll back how you can audit the changes how you can just view the data of a previous version or how you can go back at a particular time stamp right now uh if you see uh the very first command is the same command where i'm just linking my databricks notebook to the data lake as i did in the previous video now a tilde table is already created i created in the last video using this command right spark dot uh like basically write dot format delta let me just zoom it out so using this command i basically uh you know created a delta table now let's do one thing let's just simply read from the delta table right now whatever is present in the delta table we just want to read it so we have read what is present in the delta table in form of our data frame right i showed you in detail in my previous video so this is the data which is present in my data lake uh delta lake right now i have you know this data four columns and you can see the book balance right it is like two lakhs or something something now you have this data so remember the snapshot of of this data right now in your mind and then if i want to see the history of the table right so what i am doing here if you see delta table i am mentioning spark that go to my delta table at this particular location i'm specifying the location of my delta table after that go ahead and get the history dot history so this particular command of dot history this is written in scala there are multiple ways you can write it you can write it in sql python right now if you see the history of the table since i created it at uh like one time right so right now it has a version zero it has some particular time stamp when it was created it has user id user name it has operation and it has like lot of stuff right details so from where these details are coming so these details are actually coming from your delta log so if i go here right let me open a json file right which is basically the transaction log so you will see it has commit info right it uh it has the information about the username what kind of operation was done what are the operation parameters right and like which cluster was used to create it what is the notebook id what is the cluster id what is the isolation level so all these details which you see over here right uh let me show you let me take it little up so what information you see here operation right operation parameters notebook id cluster id you can actually see these information isolation level in fact i just showed you here right so those information you can actually see in the transaction log file already right now let me do one thing now i am going to overwrite my delta table right so for that i am basically how i'm creating my delta table in my previous video i showed you that i am creating my delta table by reading this particular file initial g and data dot csv if i open this particular file you will see it has a book balance column and it has some you know values over here now i am going to replace this file and i am going to overwrite my delta table now let me upload a file over here so you will actually see that this is the file which i'm uh you know changing i'm actually overwriting this file right now and if i open this file now you will see the values of my book balance you will see that it is 0 1 2 3 4 right it has changed so now let me do one thing let me recreate my delta table correct so the very first thing that i'll do over here is i will read the file from my delta table so this is the you know i have overwritten the file i have a new file now with with the new data right so i'll read it so let me read it in the data frame right now and this is how after reading it look it looks like like book balance zero one two three four five six right now let me write it as a delta table at the same location same delta table and the mode is overwrite so when the mode is over right what will happen is my previous data will be overwritten correct so now you will see that you know my it has written it has written the data now let me try to read from my delta table now i'm doing spark.read.format delta read from this particular location and right now if i do display of read underscore delta you will see that my delta table looks like this now the previous values are gone why previous values are gone because i have overwritten them right now if i go back to the delta lake if i go to this delta table youtube now you will see there are two parque files present the first file or the first data file was present when i first created it now the second file is present because just now i have uh you know overwritten the data in my delta table now if i go to the delta log you will again see that there is one zero zero one dot crc and zero zero one dot json right now this file again holds the information of content that was overwritten right now if i go back and again i try to check the history table right again if i check try to check the history table you will see that there are two versions now the first version was zero when i actually created it and the second version when i overwritten it right and it again holds the information of when it was overwritten and all the details related to it right now if i go and now um let's say i want now you can ask me pabna i want to read now how does this help actually so you can ask me like you know i want i don't want to read the latest version i want to lead the pre i just want to read a previous version so the previous version is zero now in that case you can simply fire this query you know this is written in python again you have a scala version for it you have a sql version for it so spark.format delta i am trying to read the delta table again present at this particular location but i have mentioned in my spa in my read api that option version as of 0. so if you see this is the column version in the history table so i am asking it only to read the data where version is zero so if i run over here if i run this and if i try to display this data frame like i have taken all the data in this data frame so let me just run this just let me just display this data frame so now you will actually see that now you you can actually see the book balance value are the values of version zero and now if i if i make this as version one so let me remove version zero and make it one and just rerun it so now my data frame will have the value of version one and again i try to display it you will see the book balance value changes right so in the similar way if you want to read for a particular version you can do that but at the same time remember there is another column value named as timestamp so if you want to read the data for a particular time stamp you can do that so it is one and the same thing either you use version or you use the timestamp right if i want to use the time sam i have commented it out let me just show you over here let me go back so what you can see over here is uh in the similar way instead of writing version as of i can also write timestamp as of right and i can put whatever time stamp i have like i can copy it from here and i can write it over here and the similar way i can just read it so this is the way in which you can read a particular timestamp or a particular version so just wanted to explain you this that you can go back in our version as well and similarly if you want to write a sequel so this is something you know i just for your convenience i kind of wrote so if you read this command 13 right it is one and the same thing but it is written in a sequel the one which i showed you is written in python now in this sequel in the sql command just select star from you know delta dot whatever your you know delta table location is inside your uh you know this carriage symbol and then version as of 4 version as of 0 whatever you know version you want to run it for basically you can run it let me put zero over here and then run it so similarly uh you know you can read the data for a previous version now if you can see version zero and you have this and this is just a sequel command the previous one was a python command to run that now let's say uh you know if uh i'll go back and i'll try to read my delta table again okay so let me try to read my delta table so this is the command to read my delta table i am just trying to read it in a data frame and then i want to display it in a data frame so if you see i have since i have overwritten so if i'm trying to read it where if i'm if my code is trying to read it it will give me the latest version by default right now if you say by mistake we have overwritten that file and we don't want this we want to go back to a previous version right in that case we have something called as restore command so this restore table delta dot so this is written in sql again you have a python version for it right so restore table delta dot uh you know your location and whichever version do you want to restore it so let's say you want to restore it to the zeroth version to the first version right if i do like this and i run this command then automatically you will see that your table will be restored to the previous version similarly you can do it for the time step as well i am using version because version is easier in those terms rather than the timestamp otherwise you can use timestamp as well so now you can see it has restored your data table to the previous version and it has also given you statistics right table size after restored how many number of files were removed how many restored you know what is the size of the file now if you try if you go and try to read your delta table again let me treat my delta table again so this is my delta table i am reading it in a data frame again and then i'm trying to display it in my data frame right so let me just read it now you will see that the book balance is restored to the previous values right so this is the effect of your delta versioning this is called your delta versioning through timestamp or the version numbers you have your history table and you can check the you know version of your table in that history table you can fetch the latest version you can go back to a previous version in your delta table so this is a very very useful and in fact it is one of the used commands when you are working on the data breaks so thank you so much for being till here and do let me know in the comment section if you did not get this or you want me to you know make video on some other topics do let me know thank you so much for being till here