Hello and welcome to everybody on cloud fitness. So in today's video we are going to talk about autoloader in Databricks. We are going to see what is exactly autoloader, why do we use autoloader and how we can actually implement it in Databricks. so when we talk about autoloader remember it is nothing but it is just a process to do incremental file loads so you have the data lake in your data lake your file is arriving every day you know it can be arriving like every minute it can arrive every hour right you might have incremental data coming in the in your data lake now how do you process it auto loader is just one way to process that data but now you already know this that okay it is just a process it is just uh you know way to process your incremental loads. Now, how is it different from your traditional loads, right?
What extra features does auto loader gives you? What extra benefits does auto loader gives you, right? So, when you talk about, you know, the incremental data loads, so previously, how do you use to do incremental data loads? So, the very first thing is you can process, you know, the whole load. at a time so let's say you have the incremental files coming you know once in a day or it might come two times in a day in those cases usually what you do is you can actually fix up one particular time you can say okay at 5 pm right my time zone, I just want to process all the files.
So this is nothing but this is your batch process, right? Where you handle incremental loads as well as a batch and you just simply process them. The other way is you might create a kind of ETL or ELT pipeline where you are just trying to you know manage or store you know the information on what file was processed last from the data lake and when the new file arrives you actually go ahead and you check that okay this is the file that I processed last time and now I want to process the next file which has come apart from the apart from the files which I've already read so that kind of pipeline creation also helps you to do an incremental loading and the third thing that game up was a structured streaming where you use structured streaming directly to do incremental loads so these were the three traditional ways where we were trying to process incremental loads but each of them had their own issues for example when you talk about one time load right now when you have this one time load you know that okay you are processing even your incremental loads as batch and you have the delay right so when you talk about uh you know when you are trying to store you know the last processed file somewhere right and then you are trying to do a look up when the the new fight file has arrived.
Now in that case again you have an overhead of maintaining that pipeline. You have an overhead of reading that whole file right which might eventually even grow up with the time and it also slows down your process. Now similarly when you talk about structured streaming now here also what happens is structured streaming also ends up reading everything right in your data lake. So that might also slow down your process if you have too many files coming in right every day.
So each of those has their own pros and cons but now when you talk about auto loader what it is trying to do it is also trying to do incremental load but it is trying to take best of all three points which i have mentioned here your auto loader can process your data one at a time it can do that it has a capability to do that your auto loader stores the last process file it stores that right it also can do the structured streaming because it actually makes use of the structured streaming right so it takes the best of the best of these three points now i have given you a background now let's actually see what autoloader is right you know in a more formal way so when you talk about autoloader it provides it is just a structured streaming source right it provides structured streaming source called cloud file which helps it helps you to you know eventually do the incremental processing now how does it autoloader helps you to do incremental processing it helps you to do that using the cloud file which is nothing but it is a streaming structured streaming source in the cloud in the autoloader named as cloud file which helps you to do incremental loading very efficiently as soon as your new data files arrive with the data lake your cloud files actually which is which which actually helps you to process the latest data files incrementally and efficiently right now whenever You have new files arriving in your storage account. You always have an option to also process the files which are already present in that directory. You can reprocess them as well.
So all these options are always available. AutoLoader can ingest your JSON file, your CSV file, your Parquet file, your Avro, ORC text and the binary file format. So even if you have your images, you can go ahead and use AutoLoader.
for that i also told you that autoloader maintains the checkpoint right right when i say the checkpoint it actually maintains the data till where it has last read so that is nothing but that is the checkpoint location where it maintains that what data it has already read so it maintains that checkpoint location where it stores the data as a key value store Now using this, it actually makes sure that your data is processed only once. Now essentially, when you talk about auto-loader, it has to detect your new files as they come in. Now how does it do that? It basically has two processes to do it, directory listing and the file notification.
Directory listing is nothing but it is more of a... of a traditional approach which we have talked about what happens is when you are when you have your file arriving in the data lake using this directory listing autoloader actually list everything that is present in your directory and and then it kind of checks out that which is the new file and it processes it so this is nothing but your traditional directory listing approach and the other one is your file notification so when i say file notification right now as soon as the file arrives now instead of listing everything that is present inside the data lake what it can do is it can actually make use of notification services offered by your cloud provider so for example in case of azure you can use azure event you can use Azure queue storage as well right for the messaging part so this is nothing but you are trying to make sure that as soon as the file arrives you have a notification service on which tells yeah you know where you have queue storage in case of Azure which actually subscribes to your notification and which directly you know which you can directly use to ensure that okay your file has arrived and then you have to just process it so that is nothing but your file notifications. service now this diagram actually shows you uh how both of them are actually used so when you have a cloud storage directly listing mode nothing but you know it kind of lists all the files and when you talk about your notification mode it actually makes use of uh you know your notification services in your cloud provider in case of azure you make use of event grade you make and then you you can have your Azure queues, which actually subscribe, right, to the notification, which can actually... subscribe to those notification and tell you that okay this is this mass the Q is nothing but the message right so this message has actually come and then you need to process that particular file also if you see it's not just the file notification and directly listing that you can use you can also use trigger dot once which you use normally in your structured streaming as well if you want your files to be processed only once after they have arrived so it will start your stream and once Once it has triggered the process, it will close that.
So let's see. I'll actually show you right now how this file notification works. Right. So let me go here to the Databricks notebook that I have. So if you see over here, right in the notification services, what how does it actually looks like?
The very first thing that you need to do is you need to connect your storage account to the Databricks. Now for that I already have a video you can go ahead and check it. There are multiple ways to do it. So in the first statement, you know the first statement which I have highlighted, this is nothing but but I'm just connecting my storage account over here directly, right, to the Databricks.
Now, after that, what actually is autoloader? So, autoloader is nothing but it makes use of cloud files, right? And this is how you actually program.
provide configurations different configurations as per your use case now if you are trying to use notification you provide these configurations because you are trying to use notification over here now what are these configurations now these configurations will vary so from your use case if you are trying to use notification you have to provide it if you are not trying to use notification you will not provide it so subscription id your connection string because you are actually connecting to the queue right so the sas token for your queue so service right that you have to actually provide here the file format that you're trying to use that file format the tenant id the client id the client secret you know the resource group now all this comes just because you are having the queue service associated to it right so that this details all these details actually comes from your queue service and this use notification option over here right this use notification option is actually telling you that hey go ahead when you are mentioning it as true it is actually saying that okay Okay, use the notifications that are coming from my queue. Right, use the notifications which are coming from my queue. Now similarly, you also have an option that include existing files.
So when you want to process... you want to include all the existing files or not right so that kind of options you can actually go ahead and provide here there are n number of options now this is itself you know how to set it up and all i will create a separate video on this but now just to you know make your concepts a bit clear i'm trying to show you each and everything how to actually do this now similarly once you have these configurations set up right once you have these configurations set up you actually start making use of the uh this auto loader to do incremental processing now how you say spark.readstream.format now this .format cloud files actually tells you that this is making use of autoloader so autoloader is nothing but it actually makes use of cloud files format to do the incremental processing so that is why you have to mention here .format cloud files in all the cases right options now in these options you are actually providing these configurations right in this option you are trying to provide these configurations because you have making use of the notification service and then you are providing the information about your queue that you have created now similarly uh you know you can provide n number of options now i will tell you n number of options here in this video itself but there are n number of options depending upon the different use cases right recursive file lookup right so if you have any kind of partitioning right into your data frame now in that case you can you know turn it on or off to make use of it now similarly if you have a schema if you are pre-defining the schema you can define the pre you can pre-define the schema if you do not want to pre-define the schema you can you know always skip that as well and this is nothing but your adls account location now once you do this right this is where you start reading your stream and you can again use park.writestream and you can write it anywhere. right at any of your preferred location so this is how you actually use the auto loader notification service as well right so now i'll go back to the ppt because there is a lot to it so even when i say trigger once right i can say I told you that we can use trigger once as well now how do you use trigger once now I will show you the usage of trigger one so this is how you can actually use it so let's say you are trying to read you know again from any input location where you want to enable auto autoloader right so so let's say again the same thing spark.readstream.format cloud file now this is a mandatory thing dot option cloud file format so format option is also a format of the file which you are trying to read that also you need to mention and then you are saying that okay this is the path where i want to read from and when you are trying to write write df.write right you are trying to specify the delta format while uh while writing it and then you are saying dot trigger once equal to true now once what it will happen when it finds the file it is going to run it it is going to trigger your stream once and then it is going to stop your stream you also see that it has an option of checkpoint location so this also you need to give so when you talk about this checkpoint location this is nothing but i told you that auto loader also keeps track of what it has read right in form of key value store so similarly this checkpoint location is nothing but it is a location where auto loader is making sure what it has read right so this is nothing but this is how you use your triggered one Now similarly when you talk about directory listing, this is the default mode. Right, this is the default mode.
This directory listing is nothing but it is a default mode. Whenever you enable cloud files, if you have not enabled trigger.once, if you have not enabled file notification, it automatically by default adds directory listing as the default mode. Also this particular slide tells you when to use what. use what so when you have let's say a file you have a file which is arriving daily once or daily twice and you are okay with the delay uh you know a bit of a delay you can always go ahead use trigger dot once now again if you have you know very frequently arriving files within the day it is always better to use file notification service similarly when you have files arriving right but you do not have a lot of small files arriving too frequently in that case you can go ahead and use directory listing method as well so now i'll move ahead and i'll tell you few features off auto loader which are very essential to understand so the very first is the schema evolution right so when let's say you are reading a stream right and then you have a new column being added now how does your auto loader respond to it right so there are multiple things that you can do you can actually fail the stream you can actually provide an option to fail the stream i can actually show you i will actually show you in this particular video itself now similarly you know if you want that column to be automatically added in your schema. You can do that if you want, you know that, OK, I don't want to fail my stream, but I want to make sure that whatever records are coming with the new schema, they go to a particular column.
Right. It can happen. It can happen.
and it can go to a particular column itself. So this is called schema evolution. So these features have come up with autoloader. Now, for example, I will kind of show you if you want to predefine your schema, right?
If you want to predefine your schema, want to predefine your schema you can do that if you do not want to predefine your schema if you want the schema to be inferred from the file which is being read that part also you can do that you can do that so schema inference let me take it later on right so now for example let's say you are trying to read a file let's say you're trying to read a json file from a particular location so again spark.read.format cloud files and then the schema evolution mode so this is what you actually enable right so these are few options which i'm trying to explain right so cloud files dot schema evolution mode fail on new columns so the moment you put this thing over here right so the moment you put this thing over here and then the schema right this read schema is nothing but But this is the schema that I have defined at the top. Right. I have defined my read schema at the top.
So this read schema is nothing. But it is trying to, you know, basically this read schema is nothing. But, you know, the schema which I am trying to put into the auto loader, I'm trying to say that, OK, when you are trying to read this, just use the read schema that I have mentioned. Now, if there is anything apart which is coming from this read schema, just fail it. You know, if a new column comes in and then I'm also trying to do a write.
right screen right stream right i'm trying to do a right stream and i'm trying to provide my checkpoint location i'm also providing trying to provide the location where i want to load right so the moment i actually do this what is going to happen as soon as my file comes up with the new column it is going to fail it is going to fail my stream so i ran it previously you can actually see that it has failed encountered unknown fields during parsing right so during passing it encounter unknown field and it failed so i can actually i will actually create a separate video where i show the whole setup you know and i kind of run it and show it to you because otherwise this video will be like really long so this is how you can actually fail on new columns now also i will show you rescue data column so let's say uh if you want to create a new column called as rescue data right and where you want to put the data which is not which is which is actually mismatching. right so you can put it in the rescue data column as well now how do you do that again the same things if you see spark.read.format cloud files and this is the format where you are trying to use json format now here in the schema evolution mode right you are using rescue now so earlier what you were trying to use fail on new columns right here i am just saying rescue right and then in the option i am saying rescue data column so this is the name of my column right so you will essentially not lose the data but whatever has a mismatch that will go over here right so this is the only change that i have done for rescuing my data column so the this is the way how you can actually do the schema evolution with the auto noder now when you when i talk about schema inference right schema inference is nothing but again let's say you do not want to specify the schema right let's say you do not want to specify the schema then in that case what you you can do you can always say that hey when you're trying to read the file just infer the schema from the file itself i don't know what the schema would be just infer the schema so you can always do that if you want to store your schema somewhere and you want to use it you can use that as well if you want to provide schema hints you can do that so schema hints is nothing so let's say you want to make sure that all the date columns will be date time right all the time columns should be the timestamps so those are nothing but the schema hints will show you how they actually work so schema inference so if you see the very first thing is pre defining schema you know just like your normal you know spark you kind of create the struct type inside that you kind of define the schema so this read schema is the same schema which i was showing you so this is you are defining your you're pre defining your schema so the moment you pre-define your schema and you know here i'm just saying that read my schema right so if you see i am defining my schema and then i'm saying that read my schema from there and then you know just read the files using the auto loader because i mentioned cloud file see here and just write it so this is nothing but you have predefined the schema now functioning of this and all i'll create a separate video again i don't want to mix it over here but then when you talk about schema inference right where you don't know the schema and you want your stream to infer the schema right in that case you can actually see that you can provide an option like this dot option cloud files dot infer column types that i want to infer the column types from the Cloud files etc. itself right and if you see cloud files dot schema location is also mentioned now what is this now this is nothing but your auto loader also using the cloud files it tries to store the schema at this particular location whatever you would specify so that what happens is even if there is a schema change or anything it is just trying to make a history of your schema it is trying to put whatever schema it has read at one particular place. So this particular location you have to specify where you can actually make sure that okay this is the place where my schema is present right auto loader is just writing my schema over there and it is using that schema and it is trying to read the files which are coming so this is nothing but this is the schema inference only and then you actually specify over here infer column types infer the column types from the stream itself so usually all the columns that it treats mostly they will be string only now similarly uh fail on new column so this is something that uh i think would come from the top right from the schema evolution right let me just put it over here i'll just explain you so fail on new columns so if a new column has come and you have your schema evolution mode active so we have already seen rescue data right now similarly this fail on new columns so i think i have already explained this okay okay let's just ignore this i have already explained you this okay Anyways, I'm not going to cut this from the video, but we'll go proceed to the schema hints. Right.
So schema hints is also something that you can use. So schema hints, how you can use it. Right. So if you do not know 100 percent of your schema, but you want to make sure that, OK, my event time, this particular column is always a date. You can specify it.
How do you specify it? You specify it just like this. Right. Spark.readstream.format. Cloud files.option.
Right. this is the file that I'm trying to read this is the file type i'm trying to read and then dot option cloud files dot schema hint use the schema hints from the cloud files and what hint you are providing the hint is that event time column of my schema of the file that i'm trying to use that will be a date type right read schema right so if you want to uh you know provide any extra hint to your uh you know uh read stream you can always do that. that you can always do that so you can actually see your data frame actually had a timestamp right so you can always see that if you want to provide any specific hint you can always do that specifically on your data frame as well so now essentially you have seen how basically an autoloader works you know a kind of overview so now at least you understand what exactly it is what kind of changes you need to do in order to implement the order. loader now these changes each and everything failing the stream uh you know automatically evolving the schema you know passing the bad record to a particular uh column you know storing the schema inferring hints everything we are going to discuss separately in each of these videos in each of my upcoming videos i will also show you how to connect to the storage account use cloud files and you know try to take incremental data from the storage account how to take you you know the data using the queue subscription so those kind of things know in detail we will capture it but this video was just to give a more of an overview now overview but yet a good understanding but yeah when you talk about autoloader it gives you a very good scalability right you can you know it it is very scalable because if you have lots of file you can leverage you know different cloud services you know if you want to use directory listing mode you can use that if you don't want to go ahead with that you can use the queue service as well well so that is a very good point that autoloader has come up with similarly you know it is cost effective in the sense that you know so if you are using a file notification mode to detect the new incoming files then in that case you don't need to list the directory so when you don't list the directory you reduce the cost as well as you uh you know speed up the process as well similarly you know it is easy to use so you don't have to do anything extra just dot option dot format cloud This is something that you can do.
that you have to add apart from that you do not have to track your files you do not have to create a separate pipeline to maintain the checkpointing or anything of that sort similarly schema inference and evolution right even if there are schema drifts you can actually manage it right you i have actually shown you how you can manage schema inference and how you can even evolve your schema in case you know you have one column being added or even being removed even you have data type changes we will see it how we can handle it so you can handle it very well in the auto loader so this is all about auto loader and i hope you like this video do let me know in the comment section if you have any doubts and thank you so much for being till here but do remember to always like share and subscribe to my channel