PySpark Key-Value Transformations

hey hello welcome back to my youtube channel it's ranjan and this is seventh video of apache spark playlist and in the last six videos i have covered basic introduction of pi spark how pi spark works what is architecture of pi spark and in my last video i have explained narrow and wide transformation in pi spark so in this video we will talk about transformations on key value pair in the last video we have seen transformation on simple rdd without key value pairs in the last video we have worked on this simple RDD just simple numbers but in this video we will talk about key value pairs so first of all what are key value pairs I will give you context suppose this is our table so here it is serial number and here it is name so it would be 1 2 3 and here would be Mike John Rambo so this would be a key so it is like a index but index always starts from 1 2 3 4 but in this we can name any value it is just like a it dictionary suppose we can mention it age group and in this we can say young old medium so this would be our key and this would be value this is value so now i will show you how we work on key value pairs so i have initialized spark context so in earlier video i was working on simple elements in rdd so that is the simple case of rdd but in this example we have to work on key value pairs so i will create some tuples inside the list so if you will see i am creating 1 comma 2 so first element would be my key. and here as well first element would be key and second element would be value so when I do data.collect so it will give me tuples like structure inside the list and if I will see the type of the data it would be same as it was earlier it was RDD so now I am just counting like how many values I have if you will see by just by counting it will be 8 values but I will see 4 values so it has 4 key pair structure and I will do count by value so if you remember earlier we were doing by count but now in this case it is the key value pair so what i am doing here i am counting by value how many values we have so it will create a tuple like structure 1 comma 2 is coming only once but if you will see 3 comma 4 it is coming twice so it has given me 2 and in case of 3 comma 6 it has given me a 1 so whenever we have key 3 and value is 4. it is coming two times and now I will show you a string structure in this you will get a more clarity so here I have a key is one so you can take it as a index you can suppose it as an index so here I am taking first index here I am taking second index it would be three it would be here it would be four so it is like a key value pairs and value I am taking a name so when I will do data string dot collect so I will see this structure and if we will count it would be same it would give me 4 and now here if I will do count by value so it will give me one and Mike is coming one everyone unique so it is giving me that each value comes only once and I am working on data test that was numerical so if I will type top so it will give me top two elements which is coming in tops this is coming at first position this is coming at second position this is at third this is at fourth if I will do type top so it will give me first structure one two and three four and if I will do sort by key. So these are the keys 1 3 3 3. So it will do sorting. So it will just do sort like this. And if I want to just look up, it will behave like an index. If I want to see what was the element in third key. So here, what is my third key? This is third key. This is third key. This is third key. And I have three values in this 4 6 4. So it is giving me 4 6 4 value. And if I want to collect all keys, so what I can do, I can do data.keys function.collect. So it will give me all keys if I'm interested in getting the values. So I will do this.data.values function.collect and here I have map values. If you remember in my last video, we have covered map, but here it is map values because in this dataset we have to work on values because it is key value pair. So it will behave like a same as it was behaving with normal elements so here is map values and it will just map this function to our rdd so what it is doing here it will just multiply each value by itself so first value would be 2 and it will be multiplied 2 by 2 and it will save into the values now structure would be like this so earlier my value was 2 now it become 4 so we don't use data.map data.map and data.reduce in key value pair instead of data map we use data.map values and instead of reduce we use reduce by so here we have two functions first is reduce by key and second is group by key both are same except one difference i will tell you but result of both would be same if you are trying to do by reduce by key result would be same if you are trying to do by group by key result is same but the difference is that group by key takes more computational power. takes more competition resources. I will tell you what is the difference. So these are my three RDDs. One is this, second is this and third is this. So what I want to do group by key. So I just want to create a group for each key. The purpose of both function is same. Here we have A B here also we have A B and here also we have A and B. So what I will do in reduce by key I will combine each RDD in its itself so I am combining these two partitions a.1 b.1 so I am just combining these two elements into one do we have any a key here no we have single a so what I will do I will just write it as it is here if you will see we have two keys a so what I will do I will just combine the values so I will write a and will combine one and one it will become two so same is with b as well I will write just b and We have 1 1 value, so I will add 1 plus 1 it will be 2. So this is my result and here I have 2a and 3a and 2b so i will write a comma 3 so my value would be 3 for a key and here my value would be 2 for b key now i have three results so what i will do i will combine these three results so first i will do for a so now in this i have one value in this i have two value and in this i have three values so i will add 3 plus 2 plus 1 it will become 6 so my final result for a is 6 i have I have six values in total and in this case I have here I have one 1B here 2B, 2 plus 1 3, 3 plus 2 5. So this would be my final result. So 8, 6 and B, 5. So for A my value is 6 and for B key my value would be 5. So now I will show you group by key. If you will see group, so what it will do, it will just combine all these three RDDs. Here it will not aggregate, here it will not compute the final result in this RDD itself. We will just join all the RDDs into one then it will calculate in the final stage so here if you will see there is lot of unnecessary data to being transferred over the network there would be a bandwidth issue as well it would be use more computation resources so all values are coming here and in final you will see like 1a 8 comma 1 element comes here and here these two elements comes here and here these three element will come here so final result would be a comma 6 and B comma 5 so basically both function would behave would give a same result but they are behaving in different manner so we are always recommended to use reduce by key so now i will show you syntax so this was my data.collect here are my values and now what i am doing i am just combining all the keys so here you will see i am reducing by key so the final result would work like as it was doing in simple element in case of reduce function that we have seen in previous video so what it will do it will just combine the values suppose it is combining for fun and here it is doing for three so three is a key so first it is reduced by key so in three it will see the values how many values we have four six and four so it will just combine because here we have mentioned that x would be as it is so my key would be as it is but in case of y it would say x plus y so now i am seeing here 2 and 14 so here My other example in this I am just collecting Max. So whatever the information in each key, what is the maximum value in this suppose here I have four here I have six here I have four so my maximum value was six here. So six would come here and this was group by key and this group by key would be same as was group in simple element. So in group it was generating a generator. So it would be like this. I will show you if you will see I have just perform group by key you will see the result is it is creating separate group for one how many values i have for three how many values i have so if i want to print i can see these are the values for one i have two for three i have four six four now if i want to do add i can add these values this is the like if i want to add the values what i will do i will do group by key and i will map this function to this RDD if I will do a dot So you will see the result is same, but the way is different. So here it is done same way 1 2 3 comma 14. It has just combined all the values and in this I am finding just maximum value. So it will give me same thing. Now you will see if I am using reduced by key only it would be equivalent to group by key dot map values because reduced by key automatically adds everything because it was same for simple elements as well in my last video and in this key pair value if i am doing reduce by key it is automatically adding because the purpose of the reduce is adding it's just combining either you can add or subtract or multiply but if you want to perform same operation by group by key then you have to use group by key and dot map values and you can add any sum multiply or whatever you want so now i have the example of flat map values so even it is same as it was flat map but it is key value pair so i am using values with flat map so it would perform same way so now what i am doing if you will see i have only four pairs but a flat map it just increase the length of the rdd by adding some value either zero or we can add any value so what what i am doing here i am just taking first key value pair and just key would be same and value would be up to X so what is my X here so suppose my first value is 1 2 So one would be same. So key would be same one and X would go up to. my value so my value was 2 so it will go up to n-1 so my first element would be 1.1 and second case 3,4 so what it will do first type 3 and it starts from 1 so first would be 1 then 3,2 then 3,3 because it will go up to in case of this element it will go up to 3 only because n-1 then it would go for 3,6 so it would start from 3,1,3,2,3,3,3,4,3,5 only because it is n-1 and same for 3.4 3 1 3 2 3 3 so it is just performing same way as flat map was performing so here data.collect this was my data now i am showing you example of subtract by key this is important so this is my simple data data.collect and this is my data2.collect so here i am taking only example of 3 9 if you will see here i am subtracting data2 from the data and i have taken two cases data2 from data and here data from data to so here. It is giving a result 1 comma 2. So why it is so because in data 2 if you will see there is 3 key the 3 key is present in data 2. So what it will do it will subtract the data 2 from the data. So if you will see I have 3k here so it will replace 3 here but there is no 1 key in data 2 so it will the answer would be 1 comma 2 because when I am Subtitles by the Amara.org community subtracting data 2 from the data so my 3 key would be replaced and 1 comma 2 would be there and if you will see when i am doing opposite so when i am subtracting data from the data 2 so here i have 3 so 3 would be replaced and there is no other element apart from the key 3 so it would be blank so result would be different if i am subtracting data 2 from the data or data from the data to so now I have joined so this is used to perform inner join outer join left outer join right outer join so i will show you so this was my data and this is my data to collect so here i am taking four elements one and three key here i am taking example of three and four key so when i am doing data with data to joining so what it will do it will just search for the common keys if you will see i have one key in data but not in this i have only three and four So what it will do it will just combine the common keys so first it will take three it is just write the key value and it will search the value four and nine so it will do four and nine like this okay and now it will search for another key so another key is three here so it will create a other entry three and here it value is six and other in nine so it will write six and nine and in another case it will search for the value 3 and value would be 4 and 9 because we have only one element in data2 if i will just change the place data to outside and data is inside so it will perform the same way but the order would be different here it was 49 and but in that case it would be 94 96 and 94 and now this is right outer join and this would be left outer join so in case of right outer join so it will perform a join between two rdds where key must be present in the first rdd in case of right outer join So whatever the key is present in the first RDD. So this would be my first RDD. So in write outer join, it will just give preference to this data too. Whatever the keys value are present in this data too, it will just combine only those data. So if you will see in data too, I have only key 3 and 4. So it will just create a for 3, for 3, for 3 and for 4. It will not create for one because it will give preference to data to what is inside an outer joint if I will change this way change the place here I will do data to and here I will do data so now I have changed the place of the data to with the data so here I am taking data and here I am taking data to so if you will see in data I have 1 3 okay so it will create for 1 it will create for 3 3 3 but it will not create for 4 because whatever the RDD you are mentioning here this would be preferred it will join with this data so it will create one if you will see I have no value for one key here so it will just write none and further I have two so it will be like this So and then I have left join. So left outer join, it would be equivalent to right outer join when my place of data to would be different. Here I was performing function on data to and taking a parameter as data. But in this if in case of left outer join, I am taking data to as a parameter and data and performing operation on data. It is somewhat confusing, but if you will try by your own hands, then it will be very easy. So it will be same but just here would be order would be changed. So here I am performing right outer join by two ways and here I am performing left outer join by two ways. Here I have mentioned as well it will perform a join between two RDDs where key must be present in the first RDD. If that key is not present here it would be none and here it is saying this is left outer join so that key must be present in data 2. If that key is not present here it would be none so we have covered transformations on key value pair as well in the next video i will be showing you that in which cases we don't use rdd and in those cases we would be using sql data frame so that would be a replacement for rdd because we have some problems we have some drawbacks in rdd so in that case we prefer to use sql data frame and that would be very interesting video so i hope i able to explain you about the key value pair transformation if you are facing any doubt or difficulty just let me know by posting a comment in the video so please show some support by liking the video and subscribe my channel don't forget to press the bell icon to get the notifications of further videos in your inbox so see you all in the next video till then goodbye enjoy happy learning

And if I want to just look up, it will behave like an index. If I want to see what was the element in third key. So here, what is my third key? This is third key.

This is third key. This is third key. And I have three values in this 4 6 4. So it is giving me 4 6 4 value. And if I want to collect all keys, so what I can do, I can do data.keys function.collect.

So it will give me all keys if I'm interested in getting the values. So I will do this.data.values function.collect and here I have map values. If you remember in my last video, we have covered map, but here it is map values because in this dataset we have to work on values because it is key value pair.

So it will behave like a same as it was behaving with normal elements so here is map values and it will just map this function to our rdd so what it is doing here it will just multiply each value by itself so first value would be 2 and it will be multiplied 2 by 2 and it will save into the values now structure would be like this so earlier my value was 2 now it become 4 so we don't use data.map data.map and data.reduce in key value pair instead of data map we use data.map values and instead of reduce we use reduce by so here we have two functions first is reduce by key and second is group by key both are same except one difference i will tell you but result of both would be same if you are trying to do by reduce by key result would be same if you are trying to do by group by key result is same but the difference is that group by key takes more computational power. takes more competition resources. I will tell you what is the difference.

So these are my three RDDs. One is this, second is this and third is this. So what I want to do group by key. So I just want to create a group for each key.

The purpose of both function is same. Here we have A B here also we have A B and here also we have A and B. So what I will do in reduce by key I will combine each RDD in its itself so I am combining these two partitions a.1 b.1 so I am just combining these two elements into one do we have any a key here no we have single a so what I will do I will just write it as it is here if you will see we have two keys a so what I will do I will just combine the values so I will write a and will combine one and one it will become two so same is with b as well I will write just b and We have 1 1 value, so I will add 1 plus 1 it will be 2. So this is my result and here I have 2a and 3a and 2b so i will write a comma 3 so my value would be 3 for a key and here my value would be 2 for b key now i have three results so what i will do i will combine these three results so first i will do for a so now in this i have one value in this i have two value and in this i have three values so i will add 3 plus 2 plus 1 it will become 6 so my final result for a is 6 i have I have six values in total and in this case I have here I have one 1B here 2B, 2 plus 1 3, 3 plus 2 5. So this would be my final result. So 8, 6 and B, 5. So for A my value is 6 and for B key my value would be 5. So now I will show you group by key.

If you will see group, so what it will do, it will just combine all these three RDDs. Here it will not aggregate, here it will not compute the final result in this RDD itself. We will just join all the RDDs into one then it will calculate in the final stage so here if you will see there is lot of unnecessary data to being transferred over the network there would be a bandwidth issue as well it would be use more computation resources so all values are coming here and in final you will see like 1a 8 comma 1 element comes here and here these two elements comes here and here these three element will come here so final result would be a comma 6 and B comma 5 so basically both function would behave would give a same result but they are behaving in different manner so we are always recommended to use reduce by key so now i will show you syntax so this was my data.collect here are my values and now what i am doing i am just combining all the keys so here you will see i am reducing by key so the final result would work like as it was doing in simple element in case of reduce function that we have seen in previous video so what it will do it will just combine the values suppose it is combining for fun and here it is doing for three so three is a key so first it is reduced by key so in three it will see the values how many values we have four six and four so it will just combine because here we have mentioned that x would be as it is so my key would be as it is but in case of y it would say x plus y so now i am seeing here 2 and 14 so here My other example in this I am just collecting Max. So whatever the information in each key, what is the maximum value in this suppose here I have four here I have six here I have four so my maximum value was six here. So six would come here and this was group by key and this group by key would be same as was group in simple element.

So in group it was generating a generator. So it would be like this. I will show you if you will see I have just perform group by key you will see the result is it is creating separate group for one how many values i have for three how many values i have so if i want to print i can see these are the values for one i have two for three i have four six four now if i want to do add i can add these values this is the like if i want to add the values what i will do i will do group by key and i will map this function to this RDD if I will do a dot So you will see the result is same, but the way is different.

So here it is done same way 1 2 3 comma 14. It has just combined all the values and in this I am finding just maximum value. So it will give me same thing. Now you will see if I am using reduced by key only it would be equivalent to group by key dot map values because reduced by key automatically adds everything because it was same for simple elements as well in my last video and in this key pair value if i am doing reduce by key it is automatically adding because the purpose of the reduce is adding it's just combining either you can add or subtract or multiply but if you want to perform same operation by group by key then you have to use group by key and dot map values and you can add any sum multiply or whatever you want so now i have the example of flat map values so even it is same as it was flat map but it is key value pair so i am using values with flat map so it would perform same way so now what i am doing if you will see i have only four pairs but a flat map it just increase the length of the rdd by adding some value either zero or we can add any value so what what i am doing here i am just taking first key value pair and just key would be same and value would be up to X so what is my X here so suppose my first value is 1 2 So one would be same.

So key would be same one and X would go up to. my value so my value was 2 so it will go up to n-1 so my first element would be 1.1 and second case 3,4 so what it will do first type 3 and it starts from 1 so first would be 1 then 3,2 then 3,3 because it will go up to in case of this element it will go up to 3 only because n-1 then it would go for 3,6 so it would start from 3,1,3,2,3,3,3,4,3,5 only because it is n-1 and same for 3.4 3 1 3 2 3 3 so it is just performing same way as flat map was performing so here data.collect this was my data now i am showing you example of subtract by key this is important so this is my simple data data.collect and this is my data2.collect so here i am taking only example of 3 9 if you will see here i am subtracting data2 from the data and i have taken two cases data2 from data and here data from data to so here. It is giving a result 1 comma 2. So why it is so because in data 2 if you will see there is 3 key the 3 key is present in data 2. So what it will do it will subtract the data 2 from the data. So if you will see I have 3k here so it will replace 3 here but there is no 1 key in data 2 so it will the answer would be 1 comma 2 because when I am Subtitles by the Amara.org community subtracting data 2 from the data so my 3 key would be replaced and 1 comma 2 would be there and if you will see when i am doing opposite so when i am subtracting data from the data 2 so here i have 3 so 3 would be replaced and there is no other element apart from the key 3 so it would be blank so result would be different if i am subtracting data 2 from the data or data from the data to so now I have joined so this is used to perform inner join outer join left outer join right outer join so i will show you so this was my data and this is my data to collect so here i am taking four elements one and three key here i am taking example of three and four key so when i am doing data with data to joining so what it will do it will just search for the common keys if you will see i have one key in data but not in this i have only three and four So what it will do it will just combine the common keys so first it will take three it is just write the key value and it will search the value four and nine so it will do four and nine like this okay and now it will search for another key so another key is three here so it will create a other entry three and here it value is six and other in nine so it will write six and nine and in another case it will search for the value 3 and value would be 4 and 9 because we have only one element in data2 if i will just change the place data to outside and data is inside so it will perform the same way but the order would be different here it was 49 and but in that case it would be 94 96 and 94 and now this is right outer join and this would be left outer join so in case of right outer join so it will perform a join between two rdds where key must be present in the first rdd in case of right outer join So whatever the key is present in the first RDD.

So this would be my first RDD. So in write outer join, it will just give preference to this data too. Whatever the keys value are present in this data too, it will just combine only those data.

So if you will see in data too, I have only key 3 and 4. So it will just create a for 3, for 3, for 3 and for 4. It will not create for one because it will give preference to data to what is inside an outer joint if I will change this way change the place here I will do data to and here I will do data so now I have changed the place of the data to with the data so here I am taking data and here I am taking data to so if you will see in data I have 1 3 okay so it will create for 1 it will create for 3 3 3 but it will not create for 4 because whatever the RDD you are mentioning here this would be preferred it will join with this data so it will create one if you will see I have no value for one key here so it will just write none and further I have two so it will be like this So and then I have left join. So left outer join, it would be equivalent to right outer join when my place of data to would be different. Here I was performing function on data to and taking a parameter as data. But in this if in case of left outer join, I am taking data to as a parameter and data and performing operation on data.

It is somewhat confusing, but if you will try by your own hands, then it will be very easy. So it will be same but just here would be order would be changed. So here I am performing right outer join by two ways and here I am performing left outer join by two ways. Here I have mentioned as well it will perform a join between two RDDs where key must be present in the first RDD.

If that key is not present here it would be none and here it is saying this is left outer join so that key must be present in data 2. If that key is not present here it would be none so we have covered transformations on key value pair as well in the next video i will be showing you that in which cases we don't use rdd and in those cases we would be using sql data frame so that would be a replacement for rdd because we have some problems we have some drawbacks in rdd so in that case we prefer to use sql data frame and that would be very interesting video so i hope i able to explain you about the key value pair transformation if you are facing any doubt or difficulty just let me know by posting a comment in the video so please show some support by liking the video and subscribe my channel don't forget to press the bell icon to get the notifications of further videos in your inbox so see you all in the next video till then goodbye enjoy happy learning

Transcript for:PySpark Key-Value Transformations

Transcript for:
PySpark Key-Value Transformations