hello and welcome to everybody on cloud fitness so today's video is again in continuation to my play series on the delta tables and today's series will we will discuss about the vacuum command in the delta tables we will see how what is vacuum command how does vacuum command work we will go ahead and do the hands-on on this vacuum command so in case you have not watched my previous video on delta tables i do recommend watching them for the better understanding so in our previous videos we you know that we have created this delta table target delta table load underscore youtube so this was a delta table that we have created and we did some history commands we have seen how it works we have seen how delta log happens right now we will see how vacuum command works so now before moving on to the vacuum command i will quickly go ahead and run the history command on our table we have already discussed history command so i'm not going to repeat it it's just before running any commands i just want to show you history of the table first so if you remember we had version 0 where we created the table that time we had this version zero and then we had version one because we wanted to see the uh you know history and we wanted to go back to the previous version for that we re-ran our delta table right we we overwrite uh and file in the delta table so that time we got uh version one and after that after that we did a restore i i showed you how to do a restore operation and we restored it back to version zero so this is the work that we have done in our previous videos as well now uh what i'll do is i'll talk about vacuum so now when you talk about vacuum on the delta table what exactly is vacuum so whenever i talk about delta table and in fact let me go back to this page itself so delta table would be referring some set of files right if i am doing select star from a delta table it is referring a set of file what about the setup file which it is not referring now vacuum command is actually used to delete the set of files which your delta table is not referring but now the question is why do we want to delete it we want to delete those set of files because so let's say i i might there is a process that is running every day and everyday files are coming so if that is the case then you know in an ear you will have like so many files right and you will have so many uh you know uh you know crcs and json files right now if you try if you try to read from that delta table there can be performance issues because in that case you are trying to read from our delta table and you have so many files there might be a you know performance issue now in this case we might not want to you know have uh you know a two year old file one year old files right we just want to have let's say uh one month's worth of a data we don't need any more than that right in that case we can go ahead and use vacuum command based on our requirement of how much worth of data do we want to store now based on that you can run the vacuum command this vacuum command will delete your files and then let's say uh you know these are the two files now i don't want to uh have a file in my data lake which my uh you know delta lake is not referring now in that case i will go ahead and do a vacuum command right so this is the concept of your vacuum command which kind of deletes the files which are not referenced by your delta table now vacuum command does not trigger automatically it does not trigger automatically you need to do it manually or you need to schedule it you know based on your requirement if you want you can schedule it as well but it does not occur automatically because it is deletion of data right it is deletion of data it should not happen automatically right by the data breaks it should happen when you want so default retention threshold for the files is seven days so for seven days your files will be retained always you know that is the default threshold after that you know uh let me in fact uh strictly directly jump on to the example which will actually help you so first let me mark it as true to show you how does it work okay so uh if you see right now this is how the vacuum command looks like vacuum delta dot so this is my delta table right this is the table that we created retain 0 hours dried and i will explain this command but just let me run it first now you see it has given some error now um first go let's go through the syntax vacuum delta uh vacuum the delta table right so vacuum this particular delta table retain now retain zero means zero is actually the number of hours now retain the data for 0 hours it means retain data for 0 hours this is your vacuum command and then you have something called as dry run dry run basically means that it will not execute it will just show you the list of files which will be deleted as a part of this command as a part of this command the one which i am highlighting so this is the syntax vacuum delta basically your delta table name or your delta table location so right now i have given delta dot the path of the delta table and then retain the number of hours of data that you want to retain and then delete rest of the data rest of the files just delete it so right now the command which i have written what does it do it will delete all the files and it will you know just keep the data worth of 0 hours essentially it means ah you know it will just delete the data and it will retain only 0 hours worth of files now if i write dry run dry run means it is not going to do it in actual but what it will do it will just list down the files which can be deleted as part of your command now if you see when i ran it it gave me an error now if you look at this error what does this error show are you sure that you would like to vacuum files with such a low retention period now the default retention period is seven days now we are trying to delete a file because we we just created it right we uh we did it in our previous video we just created this file it's not been seven days are you sure that you want to vacuum these files right because default is seven days and you are even trying to delete it before that right so that is why it is asking you and then it is saying if you really want to delete it then in that case check out the settings so there is a setting spark.databricks.delta dot retention duration check dot enabled equal to false you know uh check out this setting okay and if you are not sure please use a value not less than 168 hours now what it is saying it is actually uh you know databricks is trying to make sure that it gives you an error when you are trying to delete a file which is you uh which is a retention period is less than seven days or retention period is less than 168 hours it is trying to throw you an error and also it is defining that if you want to still delete it then check out this setting right now we will go to the setting because for this demo we want to uh you know play with the setting we'll go to the settings so this is the setting set spark.databricks.delta retention duration check now now you yourself think what it is saying it is saying that you have to set retention duration check enabled equal to two so right now it is true now if i want to set it as false f a l s e and just run it what it will do it will not check the retention duration now i have asked data breaks deliberately to not check seven days to not check the uh duration for seven days and run any command or vacuum that i run right so now if i go ahead and run this command you will actually see that it will list me the file which will get deleted as part of this so this is the file which will get deleted retain zero hours try run so it is going to delete this particular file now if you see if you check this particular file which file is this part zero zero okay we are not going to copy the whole name we will just see the last lines right it is e84 so this is the file name e84 it is going to delete this file now we'll go back here and see which is the e841 this is the file right this is the file which will get deleted if i run that vacuum command now which file was that if you remember what operation we have done on our so this is this was a file which we created in the second run right this was the file which was created when we ran the delta second time right now we will go back to the history to explain you in more detail so if you see we did this restore operation what did this restore operation do it started referring to the version 0 right version 0 was the first file right now my delta table is actually referring to the first file which was created at 6 5 pm right now it does not now since i have asked it to remove all the files and do not retain any files now in that case it will just remove this file because it is not referencing this file it is only referencing the first file because i did a restore command in my last video i showed you that i did a restore command right i in this restore command my delta table was referring to this file into this version now i do not need it does not need version one so it is going and it is deleting the file related to version one right version one was the second file that was created so this is how this vacuum will actually work so till now i have shown you a dryer and command right and let me let me do a one more thing let me go back and make it as true right i want data bricks to give me an error when i'm trying to delete something where retention period is less than seven days right so now instead of uh you know doing a dry run now again if i run this command i just remove try run i'm only running this command same command but i've removed dry run now i actually want to delete again it is giving me same error now what i do is if i go back and i set this uh you know setting equal to false and i run this now i have set the setting of delta duration check enabled equal to false now what will happen is my data breaks will not check for seven days period it it will directly delete the file if i am writing this condition so if i run this now you will see that it will delete the file now you see it has started executing this particular statement retained zero hours right it it it has started deleting my files right now now you can see even the message says updating you know the snapshot version now you can see this right it has executed the command now after this if i go back to my storage account i do a refresh you will see that it has removed that particular file right now it has removed that particular file this is how your uh you know uh like this was the data table that we created and now earlier it had two files now it has deleted the file which is it is not referring this is how your vacuum works now this file correspond the one which it has deleted this corresponded to version one we know it from our previous video right now if i try to if i try to let's say okay let's let let me first see the history let me first see the history of the table so now you see it has version 3 and version 4 also added now you check the operation uh over here so version 0 was the first time when we ran version 1 was the update that we did we re we over we had overwritten the file then version two we restored back to the version zero and then we started the vacuum and then version four vacuum ended so it also has updated the delta history table which we have talked about in our previous video right we talked about this history table and it has updated the history table as well now you see it has removed the file off version one correct now if i try to refer version one file let's see what kind of error it gives right now this command also we have discussed in our previous video right that restore table command so now if i want to restore the table to the version one now it has deleted the file for the version one and i'm asking it to restore it back to version one now it should give me an error because the version 1 file is not available right now you see it has given me an error the reason it has given me whole error is because you can see the specified part does not exist because that particular file was got deleted because of this vacuum command so we need to make sure uh in fact let me go back and do it back to true so we should always make sure that we do not run vacuum command unnecessary unnecessarily vacuum command is only used to remove unnecessary files which you do not require and which are taking lot of your space and can hamper your performance as well in this case you have seen that running a vacuum command if you have run a vacuum command you cannot go back right because you would have deleted the underlining file here correct so this is how your vacuum works do let me know in the comment section if you have any doubt here and thank you so much for being till here thank you so much