Spark Partitioning Overview

[Music] everyone welcome back in this video we are going to talk about partitioning again a very important optimization technique in spark right but before we talk about partitioning let's first talk about bookshelf you had the right bookshelf so let's imagine you had a collection of books and you wanted to organize it on a bookshelf similar to what you see over here right now if you were to put put all the books on the bookshelf without maintaining any particular order you would have a very hard time finding a particular book of your choice essentially what you would end up doing is literally scanning each and every book and then you will be able to find the book of your choice right now to realize this is a very timec consuming and a very inefficient approach what you could do instead is you can divide divide your books into certain sections and those sections can basically be based on the author based on the color of the spine of the book or anything else right now once you've divided it into particular sections you can essentially you've reduced the search space right and what you can essentially do is you go to that section and basically Pi up that book right this makes it very easy for you to be able to search for a particular book right and what you've essentially done is that you've partitioned your entire bookshelf into certain section in a parches park partitioning is similar to organizing your books on the bookshelf it basically means that you divide your large data set into smaller data set into smaller more manageable chunks right now I basically show this image sometimes which is very helpful to understand and partitioning so on the left hand side what you see is a very cluttered unorganized bookshelf right and in this bookshelf it is very difficult to be able to find a book on the right hand side you see neatly divided sections or neatly partitioned bookshelf in which you can easily search for a particular book okay so now that we know about partitioning at a high level let's Deep dive into the code and understand what kind of function we are going to use to partition our data set and how is a partition data set going to look like on the disk this is The Notebook that I've prepared and essentially what I'm doing here in the initial cells is that I'm setting up the spark session and I've basically used a mock Spotify data set right and this is a listening data set listening activity data set so it basically has some songs and when was that song listened to and for how much duration right so what you see over here is how the data set looks like so you have a song the time at which it was listened and the duration the number of seconds for which it was listened right now what I essentially want to do is that I want to do a little bit of formatting I want to rename this listen date column to listen time column right as I do over here and then I want to get the date out of this column right because what I essentially want to do is I want to create sections I want to basically create partitions where I can put all of the rows belonging to a particular date into that partition right so if a particular song was listened on the 20 7th of June all of those records all of the songs that were listened on 27th of June would go into that partition right so before I can do that I essentially need to create that column and that is what we do over here right I create the listen date column and the way to do that is I use the two date function from ppar SQL functions and I specify the format and this format is essentially the format what you see over here right so now let's see how this is going to look like so now I have basically created the date column using which I can partition now let's go ahead and partition the data set so I'm going to Partition by listen date and the way to partition this is to use the Partition by function so you use the Partition by function and put in the column that you want to Partition by I want to Partition by listen date and I put in listen date inside of Partition by and I basically put in the path which is listen activity partition now let's go ahead and run this and let's have a look at how the data set will look like on the disk so I basically go to partitioning partition and you see this data set right you see separate folders separate partitions or sections which have been created based on the listen date and inside of this it would basically contain all of the data all of the roles for that partition right and this is essentially how you would partition here let's move over to the second problem that partitioning solves so it is parallelism and resource utilization so resource utilization and parallelism basically means how much of the resources how much of the CPU uh the cores and the memory that you have what percentage of it for example are you utilizing it are you utilizing it 100% or maybe most of them are uh simply sitting idle 20% of the resources are getting utilized 80% of them are sitting idle is that the case so achieving a good resource utilization is very important because then it ensures that you're actually using your cluster right so um we do know for a fact that an Executor can contain one or many cores right and a partition is taken up by a core and then processed by it right so let's take that example let's say you have three executors and you have one core per executor for Simplicity so this is core three Core 2 and core one and let's say you have five partition right so what is going to happen is when I want to process these partitions these Pro these partitions are simply going to go to the course over here right it would simply go over here here here and when these three are completed these would simply go somewhere over here right so this basically shows that all of your codes are being utilized now let's consider a case where you have smaller number of partitions right smaller number of partition than what you have over here a small partition and it is a large large partition right now if I want to process this partition let's say all three of these cores are empty and I just give it to one of these cores now the problem here is that these two cores are sitting idle and this is the only core that is being utilized but this core is doing the heavy lifting it is going to work on this big partition and it is going to take a long time right ideally if this was divided and it was given over here and here this job would have taken a far lesser time but because you have a large partition it would take a lot of time to complete right so that's one case and this is not a good scenario this is not the scenario that we want to be in right now another case we see that okay this is a very large partition let's divide it into a lot of partition let's say we divided it into 50 partition now what you end up having is very small partition right and this would be like 50 partition and this would again go here go here go here and so on right now you see that you've increased the parallelism that is good but then the problem here is that you are now ending up with a lot of small files and small file problem is another problem in spark in which um much of the time is lost in um the IU operations and all of that so ideally the key lies here in maintaining an optimal number of partitions right so this is basically going to help you in achieving a good parallelism and Restort utilization so I believe this is the two key problems that partitioning solves okay so now a very important question that many of us may have is that how do I choose which column I should Partition by right what is the column that I should select like to partition my data set on now there can be several columns within my data set so for example as you see over here there are six columns in my data set right how do I know that I should Partition by listen date or for that matter any other column right which is the most appropriate a few important point that should be always taken into consideration when choosing a partition column right so the first one is that your your column can be a high cardinality column or it can be a low cardinality column right by cardinality I mean the number of unique elements within that column right now let's take an example of a um e-commerce transaction data set right and you have a customer data set and in that customer data set you have a column called customer customer ID now this column has a very high cardinality because each customer is going to have a unique ID and um all of the roles if if we take the customer data set all of the roles are going to have a um a a new customer ID or even if you take a transaction data set right the number of unique elements within the customer ID column is going to be large so if you Partition by this column you end up having a lot of partition and this doesn't help park for the simple reason is that it would essentially end up scanning several partition basically all of the customer IDs in your whole data set right so it doesn't help spark in eliminating or reducing the search space and figuring out the right column to select now coming to the low cardinality columns right now low cardinality column in the same e-commerce transaction data set can be um can be uh the state for example the state from which a customer is ordering right and essentially you would assume that the state would have far more number of rows than customer ID if we were to partition on the state right so this is a good column to partition on because then you would have a good chunk of rows inside the state and Spark can basically select a particular State based on the filter criteria that you specify right so low to medium cardinality columns are good now again keep in mind that we do not want super low cardinality column we don't want column which has only cardinality of one because then you end up having only one partition right and it doesn't make any sense for spark because it would just scan one partition with the whole data set right so that is the first criteria the second criteria is your filter condition if you are very frequently filtering by a particular column it's very important to use that column right because then again spark can specifically select that partition and your querying will become faster so these two are very important criterias to keep in mind whenever selecting a particular par now that we know about the concept of partitioning let's play around a little bit with it right uh I've shown you single level partitioning right so over here um you see that this file is created with a single partition called listen date now let's say you wanted to Partition by listen date and the listen hour as well right so basically basically what you can simply do is you just put in the columns over here right you put in the listen date and the listen hour now let's try to run this and I've saved saved it as partition two so if we run this we see another file has been created now you see the listen date and inside of this you also see the listen hour over here right lessen hour equals 10 over here so basically what it does is that it creates two partition multi-level partitioning based on the column that we specified inside the Partition by now it's very important to be careful about the order over here because then it creates the file in that order it creates those partitions in that order so let's reverse the order and see what happens let's first put in listen hour and then let's put in listen date and we name this partition three so you see another file being created over here and now you see uh the listen hour over here listen Okay so we just have one listen hour so you see the listen hour and inside of that you see the listen date right so just just to be careful now let's say uh you wanted to control the number of files inside every partition so for example let's go inside this right uh type this listen date um which is 25th of April you have one partition this is one partition part zero the other one is the CRC file so let's say you wanted to control the number of partitions inside this you wanted to have not one but three partitions right so what we do over here in order to solve for that is we do a simple repartition before doing a partition by so we do a repartition by three and then we do a partition by listen dat and let me rename this to partition four let me go ahead and run this so now okay so it's still creating the files so it's done now and now you see that you have part Zero part one and part two right the other files are the CRC file so you're now controlling the number of partition inside of actually the number of files inside of your partition the listen date partition right and you do it using the repartition function let me again let me try this again with a with another example let's say repartition six and let's see how it behaves um okay this would take a little bit of time so this is done and now you see five six partition six partition have been created um as a result of doing the repartition six right now let's do a similar operation with uh call S3 and let's see how how does this behave we would ideally um think of having three files inside of every partition and I name this to partition five right so now let's go ahead and run this and oh okay so you just see one file you just see one file let me put six instead of uh instead of three over here and let me rerun this again okay there's no effect right you just see there's one file and the reason for this is because Coes wants to avoid a full shuffled right so it adjust into whatever the existing number of partitions are there that is defined by Partition by right so it basically puts it into one partition which is the existing initial number of partitions right so this is how you can control the number of partitions inside of a partition by basically the number of files inside of a partition B okay so now we are going to look at an interesting property in spark that is related to setting your partition s property is called spark do SQL do file do Max partition bytes so this property basically decides what is going to be the maximum size of a partition that spark is going to read right and based on this value it is going to split the file so for example let's say you specify this number to be 128 right and your total file size is 512 right so it is going to read one parti ition in chunks of 128 so this is going to ensure that you have four partitions in total right so this is how you basically specify the partition sizes that is going to be picked up at read time so now um let's let's take a look at this code over here and I'm basically creating a spark session again reading the Spotify listening CSV right listening activity and my objective here is to find out how many partition did it read from the initial file right and for that I'm basically doing DF do rd. get num partition right let me go ahead and run this and you see that it only read in one the whole file as one partition right now what I want to do is that I want to load this files in the form of multiple partition so what I set over here is the max partition bite right now before we go ahead and do that let's first find out the the total size of a file right so what we do over here is let me go to CD data partitioning raw and let me do U minus H 4tify listening so this basically is 448 KB right so your total file size is 448 kilobyte right now what I'm seeing here specifically is I want to divide when I read the file I want to divide each file into one KB partition that is what I say over here that is 1,000 bytes which is equivalent to 1 kilobyte I say that read each partition in a size of 1 kilobyte so total size is 448 KB and if I want to read each file in sizes of 1 KB I'm going to end up with 448 partition right now when I do a final get num partition I should be expecting to get somewhere around 448 right but just be aware of the fact that this is Max partition bite some may be even smaller than th000 so the number that you get may not be exactly 448 so let's go ahead and run this it should be somewhere close okay so you see 457 which is very close to 448 so this is how Max partition bytes helps you decide the number of partition number and sizes of partition at read time whenever you read in okay that's all about partitioning I hope all of this was comprehensive and it gave you a clear idea about what partitioning is in spark it's great to see that you've completed the full video if you found value please don't forget to like share and subscribe thank you again for watching

Transcript for:Spark Partitioning Overview

Transcript for:
Spark Partitioning Overview