Salting to Solve Data Skew

[Music] welcome back to another video in this video we are going to study about how to solve data skew using Saltine okay so we are going to understand salting in aggregation and join but before that let's try to understand what exactly sorting is so salting basically refers to adding Randomness to a key in order to be able to evenly distribute it right so let's try to understand it with an example so let's say you're trying to join this data set the data set over here with another data set and let for a moment forget the other data set that we are joining to let's assume that this is the key on which the join is going to happen so there is a column called value and the join is going to happen on this column if we have a look at how value looks like we basically see that it contains a lot of ones right and we assume that the number of one that it contained is somewhere around 1 million the number of twos is somewhere around 5 and the number of Threes is somewhere around six right so this is this is the kind of value that you have in your data set now if we were to join this data set with another data set there is going to be a very simple logic which is going to decide vishiro is going to go to which Shuffle partition right so this row is going to which Shuffle partition and the logic is there is going to be a hash of whatever value you have mod the number of Shuffle partition right and we assume that the number of Shuffle partition is going to be three so this hash of your value is again going to give you an integer and that is going to be mod 3 which is again going to give you 0 1 and 2 which is three of the partition right it is going to go to either one of those partition any of these values right show the distribution that you're going to have is going to look something like this now so all of the ones are going to go to the same partition so you're going to have something like this one is going to look like this it is going to be 1 million 2 is going to be somewhere over here like this the values are going to be very small these are all the tools and then you also going to have all the threes which is again going to be very small so these are three of the partition in which your data is going to go to right now we see a problem here that this partition is very large and hence cute so it is going to take a very long time to process this now how do we resolve this in order to resolve this we are going to apply salting so the first step is that we choose a salt number and the salt number basically indicates how much we want to distribute the data right so the number that we are choosing over here is three so what I'm simply going to do is that I'm going to assign random numbers so I'm going to create a new column called salt and I'm going to assign random numbers between 0 and 3. 0 inclusive three non-inclusive to each of the rows so it is going to look something like this 0 to 1 1 0 and then so on for all of this right this is how it is going to look like so now what we are going to see is that instead of doing the join on value independently let's now do the join on value and salt so this is going to be the change so this becomes the new join key so when this becomes the new join key the rule which decide which row is going to go to which partition also changes so the new rule becomes something like this it becomes hash of value comma sold mod number of Shuffle partition right now what is the advantage of this new rule so it is going to help us divide the ones further into three partition so earlier what was happening was that the hash of 1 mod Shuffle partition this is always going to be the same value right let's say hash of 1 gives one for example one mod 3 is always going to give you one so this was the reason all of the ones were going into the same partition as shown over here now what is happening is that hash we are going to have three different hashes for the value 1. we are going to have hash one comma 0 hash of 1 comma 1 h of 1 comma two and all of this is going to be mod 3. yeah so these are now going to give us three different values and now let's assume that they give us 0 1 and 2. so this is going to again result in 0 1 and 2 right so this basically shows us that by using this rule the new rule which happened as a result of the change in the joint condition right we are able to distribute the ones more effectively into three partitions as of now so now the distribution is going to look somewhat like this is going to have somewhat evenly distributed data and this is going to be around 330k and this is going to be all your wants this is going they're going to be some tools and there are going to be some so what we see over here is that your data has got distributed between partition right somewhat evenly right so this is what is the aim of salting and this is the concept behind sorting we'll have a look at how to apply salting in practice on skew dataset so let's assume this is the first data set and as we've seen earlier this data set is cute because it contains a lot of ones and all of those one go into one single partition close to one million ones right and the second data set that you have is a uniformly distributed data set because you see that partition 0 has three values partition one has three values and partition two has four values more or less distrib uniformly distributed right so the first step that we do is we choose a salt number and this salt number is very important this is going to decide how your data is going to be distributed how much your data is going to be distributed right and this number needs to be chosen very wisely because then if you choose a large number then you're going to end up with very tiny bits of data in each of the partition and you also wouldn't want to shoot a very small number because then all of your data would be jam-packed in single partitions right so what we do is that we choose a salt number and then we start assigning numbers between 0 to the Salt number zero inclusive three non-inclusive that is the salt number not included and we call that column the salt right so we see here that each row is assigned a random number between 0 to 3. so this has been assigned 0 this has been assigned two and so on right so this is how we do for the first step now let's come to the second step in the second step what we do is that we create an array which is going to contain all the values from 0 until salt number -1 which is this one so the salt number is 3 so we are going to have 0 1 and 2 as the array over here right so now what we are going to do is that we are going to explode the array and this would simply mean that all of the numbers in the array get converted to single rows so this would simply mean that this would get converted to 1 0 1 1 and 1 2. the reason why we do this is because if we think logically there are two data frame and to the First Data frame we are assigning a random number right now in order to match it to the second data frame we must Ensure that the second data frame has that number right let's say we assigned 0 the salt number zero to one we must ensure that 0 is also present in the second data frame if we were to randomly assign any number to 1 then there would be high chances that when the join happens both the numbers wouldn't match right the salt for one in the First Data frame won't match with the salt for one in the second data frame so that is why we assign all possible values so that the join can happen so now we do the explode and we see that one has been converted to three rules one has zero one and two and then 2 also has converted has been converted to three rules which is 0 1 and 2 right and similarly we do it for all of the rows over here okay so now the third and the most important step is that we are going to do a join on the value and the salt right so earlier the join was only on value now the join is going to happen on the value and the salt right so this is going to ensure that the key that was unevenly distributed gets evenly distributed so earlier we were having a hash of value which means the hash of 1 mod 3 3 were the shuffle partition so this is always going to one right it was always going to one partition now what we have is because the join key has changed because it has become value and solved it is going to be hash of 1 comma 0 mod 3 this is let's say going to give you 0 again 1 comma 1 1 comma 2 and these are going to give you two different numbers right so let's see this gives you 1 and this gives you two so that means one is going to get distributed in three partition and that is what we see over here it gets distributed in three different partition so this is how salting is going to help you distribute your data sets evenly or almost evenly right a very important point to note is that we are exploding the data set over here right and explode is a very costly operation so if you are joining to data set and one happens to be smaller it is generally recommended to do the explode on the smaller data set if the data sets are almost the same size then you have no choice you would have to do on either one of them right so that's all about the concept for salting okay so let's see salting in action so here I am importing a few packages also trying to set up the spark session so here first we are going to do a skewed join a join on a secure data set and then we are going to apply salting to see the Improvement so this is a uniformly distributed data frame the first one and just to make sure that it is indeed uniformly distributed what I'm doing is that I am finding the spark partition ID per row so this function f dot part partition ID which basically comes from Pi spark SQL function it gives me the ID per row and what I'm doing is that I'm simply grouping that partition number and doing a count so here we simply see that most of the partitions are uniformly distributed they are 12 partitions all of them almost contain the same amount of data now this data set is not uniformly distributed and the way we've created it is the First Data frame it contains zeros 999 K times and all of this has been put into one partition while the other numbers they are very small in number one is 15 times the other one is 10 times and so on so again we would like we would look at the distribution of the data in the partition then we see that the zeroth partition contains the most data right now we simply do a join on the value and let's have a look at how the distribution now looks like after the join so you see that the data is secured the zeroth part in Partition contains the most rules while the first one just contains 15 rows now if we were to apply salting on this let's have a look on how it would improve so in order to take the short number I'm just taking it from Shuffle partition and the shuffle partitions I've just set it to three so the salt number is going to be three and as discussed in one of the data frames we are first going to assign a random number between 0 and Shuffle partition partition which is the sort number and the salt number is not included so this is how it's going to look like so you would see that 0 is being assigned to 2 0 is assigned to 0 or 0 again as a result of 0 and so on right now for the second data set what we said that we are going to have an array of numbers between 0 to Salt number minus 1 and then we are going to explode it right so let's see how that looks like so 0 basically had an array which would look like 0 1 and 2 and has been exploded into three rows and this is what those three rows look like so 0 has a salt of 0 1 and 2 similarly for the others right now what we do is we do the join on the value and the salt and now let's see how the distribution looks like on the value and the partition ID okay so now what we see that the value 0 has been uniformly distributed on three partition which is 0 1 and 2. you see somewhere around 332 to 333 K being uniformly put in the first three partition and all the other ones are more or less the same right so this is how salting helps you salting can help skewed joins be uniformly distributed let's understand salting in aggregations and for this we would take the same data set that we took earlier so this data set has one million zeros and then a few ones and twos and threes right and the objective that we have over here is simply to count the number of values so what we simply want is something like this so we want the number of zeros that is one million and similarly we want the count of all the other numbers right now the most easiest way to do this is simply by doing a group by Group by whatever value this and let's assume that the value the column name is called value we do a Group by value and then we do a count so if we do this we are simply going to achieve this right and let's try to have a look at how it is going to look at inside the partition right so when we do a group by a shuffle is going to happen right and when Shuffle happens the same key goes to the same partition and it follows some logic right and the logic is going to be something like this hash of value hash of whatever the column name is mod Shuffle partitions right and the shuffle partition the number of shop rule partition that we have over here is four yeah so when we do a hash of these values over here all of these values over here we are going to get some number so let's say we do a hash of zero mod 4 this hash of 0 is going to give me some integer right and mod 4 is again going to give me some integer right let's assume that this is always going to be 0 this is always going to be some number right now all of these values are always going to yield that same number right so whenever you do a hash of 0 mod 4 for this Row for this Row for this row and all the other rows it is always going to give you the same number and that is the reason why all of the zeros is going to go to this partition right similarly all of the one is going to go to this partition to the going to this partition and threes are going to this partition now what you would observe is that this partition partition 0 contains several ones it contain one million ones right and to aggregate to to basically count the number of values the number of zeros in this partition is going to take a long time right because and these partitions are going to finish processing earlier so the data at this partition is cubed okay so let's try to see how are we going to apply salting in order to solve this problem so in order to apply salting the first step that we do is we use a salt number right so we choose a salt number and this is going to decide how we want to distribute our data right so we choose the salt number let's say to be 4 and this basically decides how much we want to distribute our data right so what we are going to do is that we are going to assign a random salt to each of the rows over here and let's say we are going to do something like this 0 1 2 3 0 1 and so on right so we assign random salt random numbers to each of the rows and what this is going to ensure is that it is going to distribute your data more evenly across the partition so now what I'm going to do is that in order to achieve what we wanted we are first going to do a group by again a group by the value this is the value and the salt Group by value and the salt and then we are going to do a count but hang on you're going to tell me that this is not going to give me the answer right but it is going to give us some Advantage some Edge over the previous method and let's see what is going to happen so when we do a group by over here again it is going to incur a shuffle and the shuffle over here is going to be based on this logic so the logic that is going to be followed is going to be hash of value comma salt mod whatever Shuffle partition it had right Shuffle partitions so earlier what was happening was that hash of 0 hash of 0 mod Shuffle partition which was four this rule was being applied to all of the zeros and that is why all of the zeros were going to one partition now what is going to happen is that hash of 0 is being combined with the salt and that salt can be any number between 0 1 2 and 3 right it is going to be 0 1 hash of 0 2 h of 0 3 and H of 0 0 right so this is going to give me four different values right mod 4 this is going to give me four different values this simply ensures that 0 that was be going to one partition earlier now it goes to four different partition and let's see how it's going to look like right so this was earlier going to one partition now what is going to happen it is going to get divided into four part and then it is going to go to four different partition and this is how it would look like in the group by step so you see that around 250 K value stay in the 08 partition around 250k values so here I've tried to show equally distributed but it may or may not be exactly equally distributed so here you see that sum of the values go to the third partition and then the fourth partition so here you have rows with the salt 0 here you have wrote with the salt one with salt two and with salt three right again this is ensured by this number this rule over here that you get right so this is going to decide that okay this is going to a particular partition this is going to another partition and similarly for these two as well right so now we have done the first step over here the group by value this is done and that is what we see over here we we see the result over here now we want to do a count so what is going to happen is that this is simply going to give you the count so 0 comma 0 is going to have a count of 250k similarly the keys that you have because it is grouped by value and salt you are going to have two keys over here one comma 1 and 0 comma one similarly for this one you're going to have this is going to give you one count this is going to give you one count so the count for this 0 comma 2 is 250k 2 comma 2 is 0 and similarly for this right so now we are done till this step these two steps are done right now what is going to be the next step so the next step is basically to do a group by again on the value right we do a group by again on the value so that all of all of the groups that we for that we collapsed over here so what we see over here that the groups have been collapsed earlier all of the zeros were in one partition now what we've done is that we've divided it into four partitions and we've now collapsed it into one row right we divided it into four partitions over here and then we collapsed it into one single group in each of the partitions right so now once that is done what we simply want to do is that we want to achieve what our original question was right so we simply do a group by value and then we do a some aggregation of f dot sum and then we do a count okay so Group by value is again going to incur a shuffle so what this is going to do is that it is going to bring all of the zeros on the same partition this comes to the same partition because right now the group by key is value it is this number so all of them are going to come to same partition and this is simply going to stay where it was over here this is going to stay where it was and similarly for this right now the final step is just to do a sum because you have already computed the count in each of those single partitions so you you've reduced a bigger job into smaller jobs so the final value that you get over here is something like this so you do a sum and then you get the final value over here which is 1 million this stage as it is this stage as it is and this stays as it is yeah so this is how using salting in aggregation helps you distribute the data more evenly and you would see the difference in larger data sets wherein the data one of the partition is queued one or more partition is queued and then Distributing it this way makes the computation on the course more faster yeah now let's have a look at some of the code what we are trying to do is over here let's first do a count without sorting and then what we do over here is we simply create a salt column and we do a group by by the value and the salt and we do a count right this shows the first step that we did in the diagram now as a last step we do a group by on the value we should incur the shuffle again on the smaller already summed up counts right on on the count that you've already gotten from here and you do a final sum over here so that gives you the final value I hope all of that makes sense and you're able to understand salting in a better way it's great to see that you've completed the full video if you found value don't forget to like share and subscribe to my YouTube channel thank you again for watching

Transcript for:Salting to Solve Data Skew

Transcript for:
Salting to Solve Data Skew