Spark RDD Transformations Overview

hey hello welcome back to my youtube channel it's ranjan and this is sixth video of apache spark playlist and i have covered five videos already these are the videos if you will search for ranjan pi spark you will get five videos and in these videos i have covered basic introduction of apache pi spark how big data works big data and hadoop map reduce pandas and in the last video i have covered one type of operations in rdd which was action and in this video we will talk about transformation so what is transformation so transformation is an operation which we apply on RDD and input of the transformation is RDD. If you will see the input to transformation is RDD and output is also a RDD. So it is a function that takes input as RDD and produces output as an RDD. So you can say that it is producing new RDD by using existing RDD.

Now here's the catch as we have already covered that transformation is lazy in nature so it does not create RDD on the fly. It just check the syntax whenever you will apply transformation it will just check the syntax but it will not generate new RDD. It will generate when you will apply action on that.

So to be more clarity that whenever you apply transformation on any RDD so it will not perform the operation immediately it will create a DAG that is directed acyclic graph using the operation which we have applied and that function would be used for transformation When we apply action on that, that graph is DAG. So it will keep building on that graph till you apply this action on the last transformation RDD. So that's why transformation in Spark are lazy. Transformation are divided into two parts.

So first is narrow and second is wide. So transformation are divided into two parts, narrow and wide. So this is narrow transformation.

And we say pipelining as well to narrow transformation. Why we say pipeline? Here is the, so this is my RDD1 and I am applying some transformation to this RDD and then my RDD2 is created.

Now why it is narrow? Because it is creating one to one. What is one to one transformation? So whatever the elements that required for calculation the result, suppose I am applying any operation, I am applying filter operation to this partition. So I don't need this data.

because I just want to filter so I will apply one filter to this data and I will apply one filter to this data and it will produce me a result and this also will produce a result then in the final I will just concatenate the final result but there is no need to shuffle because this data is not dependent on this data suppose this is a data in which I have an age okay and I have to filter out odd age whatever the odd ages are like suppose 21 23, 25. So I will search all the people who are in the odd age and from this I will also search the odd age. So that does not depend on this. So I will tell you in which case it is dependent.

At that point you will be able to clear your doubts why we are saying this as narrow transformation. So we can also say no shuffling of data because we are not shuffling the data between these partitions. So that's why it is called pipelining because it is just like a pipeline.

We are by applying operation on this node only. because in RDD the data is distributed among nodes some data is stored in this node some data is stored in this node some data is stored in this node suppose I have 30 rows so 10 rows stored on this data 10 rows stored on this data 10 rows stored on this data I will just find like whoever are in the odd age category I will fetch the data from this I will fetch the data from this I will fetch the data from this so I can just do directly but in case of sort suppose I want to apply sort on this So first I want to join both nodes data. First I have to join the data of both nodes.

So here we are applying shuffling. So that process is applied in the wide transformation. So this is narrow transformation and this is the example map, flat map, map partition, filter, sample, union.

I will be showing the code of these narrow transformation. And here you can say that limited subset of the partition because here we are using just this subset of the whole data because data is stored here as well but i am using in that node only subset of the whole data so we can apply operation on this subset of data and we are getting the result so this is narrow transformation and now we have wide transformation so this process is also called as shuffling all the elements that are required on which we have to apply calculation they may live in same partition or they may live in different partition so here i am taking three nodes So in node 1 I have different partition of data in node 2 I have different partition of data. Suppose I have 30 rows.

I have excel file of 30 rows. 10 files are stored here. 10 files are stored here and 10 files are stored here and I have apply white transformation so this diagram is self-explanatory you will see when we applying transformation so we have three nodes but in this node you will see the first partition comes from the first node and partition one comes from the this node and this partition comes from this node and this partition comes from this node so the data is coming from different different nodes so here we are applying shuffling why we need shuffling like i will give you scenario so These are the examples. So first, suppose I have to search distinct.

So I have some Excel dataset and in that I have names column. Okay, so 10 names are stored in this dataset in this node and 10 names are stored in this node and 10 names are stored on this node. I have to find distinct that means unique name. So first I have to search from this first I have to search from this first I have to search from this that in that case it will be three unique names, but I have to search one unique name. So first i will shuffle the data i will join all the data then i will apply the distinct function then i will get only one unique name so that's a need for shuffling so we can see intersection suppose if we want to see intersection what is the common data what is the intersection we all have done in math suppose this is a a set and this is a b set so these are two different data or we can say different partition and just we want to see what is the common data in these datasets into datasets A and B.

So this would be common data which would be found in both datasets. So to find this part of the dataset we have to apply shuffling because I need to join these two datasets. I have to shuffle the data.

So these are the examples of the white transformation and this is group by key and reduce by key. These two are used on the RDD with key value pair. So this key value pair is just behave like say dictionary, dictionary in Python. In dictionary in Python, we have a key and value.

So there's other type of RDD in which we will use as a key value pair in RDD. I will explain this concept in my next video. In this video, we will see the examples of the transformation, narrow transformation and the wide transformation. So let's start the Python implementation.

So as I told you these are the types these are the narrow transformation and these are the wide transformation whichever transformations having a key so that would be used in key value RDD. So first we see the map example. So what is map? This is our RDD and map is a function in which we can apply any operation.

Suppose we want to multiply each row by two. So we will just create a function and we will map that function to this RDD. Once the function is mapped to this RDD it will create a new RDD. So you are saying that the number they are three before applying transformation and even after applying transformation the number is still three but the color is changed so that means some property will change so this is the example so i have already initialized sc that is my spark context and i have already covered what is spark context if you have not aware about the spark context just watch this video and here i am creating a new variable which is num so it will be my rdd and i am initializing by some random data so you will see like this is my data and now i am applying lambda function I am mapping Lambda function to this RDD. NUM is my RDD, which is this information.

So what I am doing, I am multiplying each number by 2. It is very basic, but I am just showing you the function of the map, how it will perform. And at the end, I am using collect. So it will give me all information. So you will see that it has multiplied 2 to every element.

and now I am giving you another element. It will just square each element that it will be 25, 25, 16. So it's not necessary that I have to use lambda only here. I can use any default function or user defined function.

Lambda is very easy. We can apply lambda function in just one line. So I am using that and even I can apply with string as well. So these are four names.

If I want to apply Mr. with each name, so I can do like this. So here I have used just map. It has concatenated Mr. to everyone and now we have the flat map function so it is exactly similar to the map function but here the catch what is the difference if you will see the number of rdd are three but this you will see there are number of rdd so in the flat map each input item can be mapped to zero or some other output so that like suppose this rdd we want to apply flat map function and i can create three element by using this single element and from this i can create three elements from this single element and from this element as well i can create number of elements so it depends which function we are applying and you will see the size of the rdd is small so property in map function you see that size is same number of partitions are same but in this number of partitions are different so how it performs so this is my rdd that i have created new rdd and this is the range function that i am showing you that if i will apply range one two three and i will print each every element it will show me one to two now I am applying flat map and in this I am applying range 1 to X so first element in RDD would be my 2 so when I will use first element 2 and it will go to 1 to X so X would be my 1 so it will apply first it will create element 1 and after that it will take 3 so then it will create 1 and 2 so here same it would be so first it has created 1 that is for this because it X would be 2 so it won't go to 2 because the range would be n minus 1 if If it is 2, so it will go up to 1. So it has printed 1. and in that case my second element is 3 so next it would be 1 2 3 so it will print 1 and 2 so here it is printed 1 and 2 so in in case of 4 it will go up to 3 so it has printed 1 2 3 earlier my rdd was 2 3 4 and now it has been changed into 1 1 2 1 2 3 so another example i will show you so this is my new rdd a 1 2 3 now what i will do i just apply flat map and in the flat map first it will type 1 so it will be 1 here and then it will just multiply 1 by 10 then in third case it will just type 57 as it is.

So result would be 1 10 57 then would be 2 20 57. Let's see result is same 1 10 57 2 20 57 3 30 57. Now we have filter. So filter is the operation in which it will give us a new data set but by selecting some filter criteria we will filter some criteria on the source which will return some elements. Suppose we want to search odd values even values or multiplication of anything or just by it is just like applying filter in excel as we do so it will just give me one rdd if you will see like i have applied filter this middle rdd got disappeared so here i am seeing only two rdd so here i have three partitions of rdd i am saying many times this is rdd but these are the partitions of single rdd so here we have three partition when we have applied filter transformation to this so it has created two partition one partition got disappeared because our filter criteria is not matching with this. So it the result of this filter operation only less information that was earlier in the input data set so here it is information so this would be my data and now i am applying some filter operation in this i am mapping a function so this would be my whatever the information is even just give me that result so it will be give me even result now even i i can do same as in case of string so i am here filtering whatever the information is starting with b just give me that information so it is giving me bells and brain so now we have Union from the name it is self-explanatory that it will unite it will combine both data it is just combining both data so these i will show you so first is my num variable and another variable is my num2 in this i am taking different data set so now you will see like some data is repeated four is here as well four here as well and nine here as well here as well when i will apply union first it will join both rdds it will just combine both data set irrespective of that element is getting repeated or not if that is getting repeated it will just show me twice or thrice so here i have combined num2 and num so it has done like this here order does not matter if if i will apply num2 here and num here so the result would be same it would not be in order because if i will apply num2 in the first it will just print this data first and num is at the second level so it will just print the num data here this is the key value pair that i will cover in next slide and now we have sample.

So sample is a type of narrow transformation in which it will provide me a sample. For that sample we can define fraction and we can define replacement. Suppose I have this data 1, 2, 3, 4, 5, 6 and suppose I have to extract 4 sample from this.

So I have extracted first. I have extracted 4 here. So if I am replacing 4, I am just deleting 4 from the original sample so that means I am taking sampling with replacement. So in that case, I have to define whether I want to do replacement or not.

So I have taken second element as 5. So if I am deleting the 5 from the original data, so that is sampling with replacement. Okay, this is my data parallel equal to sc.parallelize. I am initiating 1 to 10. This is my 9 elements.

And here I am referring true. So true means it would be sampling with replacement. So that same data cannot repeat in this sample and point 2 that is the fraction of the original data set. So it will just give me 20% of that data. And when I will run this again, so it will, the result would be changed.

Result would be different because it is sampling with replacement. If I want to result would be same, I can apply seed. So if you will see, I have here given a value 19 to the seed. If I will run this many times, result would be same because I have just seeded this value to 19. I can apply any value, but if I want to get the same result, the value should be same each and every time.

So we have covered the narrow transformation. So these are the my narrow transformation I have covered. Now it's wide transform. So first operation in wide transformation is group by.

So how it works I will tell you. So this is the explanation in which it is very clear. I will show you. So this is my first RDD in which I have four partitions so what i want to do i just want to do grouping by so i just want to create separate group i can create group on the basis of the sex male or female i can create groups on the basis of age i can create group on the basis of the alphabet suppose if i have many employees i just want to grouping by them alphabet so i will create separate group for each alphabet or i can create the bins for the age so i will show you in the example it will be clear these These are the four names what I want to do. I just want to grouping them by alphabet.

So I will apply names and dot group by. So this is our function. So there are two types of operations in white transformation.

First is group by and second is group by key. So group by we can use on normal RDD in which we have only one type of element. But in group by key that would be used in the key value pair that I will be covering in the next video. So here I am taking the example of simple group by. In this group by I'm applying lambda function.

I can apply any function. So in the Lambda, I am just creating it. taking a only first alphabet of each name so it will just sort by if it is b it will create a separate group if it is m it will create separate group so i will create group so now i will check it has created a two group first is b and second is for m so now it has created a generator but i have to see like what are the information in the b and what are the information in m but it has created a structure somewhat like a key value pair so i have to apply for loop on this and for loop I will apply.

K and V. I will take two variables. I can take any variables, but I have to take two variables so that I can see what are the information on the key and what are the information on the value. So this would be behaving like a key.

So I will show you, I will print it. So these are the names and names is the variable is the RDD and in names I have four names. So here I am applying a function here. I'm using group by and in group by I'm using lambda.

I can use any function in that. So what I am doing, I'm just taking the only first alphabet of each name. So I am just storing.

So I am type I am using collect so it will give me the all name in this function if you will see that it has converted into key value pair so this would be key and this would be value and this is like a generator to see the information inside it I have to convert into list so what I what I will do I will just apply a loop on this and I am using key and value so this is just a variables k and v I can use any variables so it has created a structure like key value pair so I have to do like this and I will run this I will get the information be he has two values Bill and brain and M has two values Mark and make so I will show you another example so this is my new variable a and it is RDD so in that RDD it has one one two three five eight so I will create some group so whatever the element is getting divided by two and what the value is getting so I have created a new variable which is a so it would be my RDD and in this rdd i am taking these elements 1 1 2 3 5 8 so now i am using group by so what it will do it will just create a separate group suppose i am dividing each element by 2 so whenever i will divide any element by 2 the result would be 1 or would be 0 so it would be like odd it would be odd or even so it will create two groups first for 1 and another for 0 if i will see the result of result so it will just like a key value pair and this value would be like a generator so i will have to convert into list. So I'm running loop kav in result. So it will print like this structure. So if I am dividing these elements by 2, so I am getting remainder 1 otherwise 0. So now is intersection. So intersection is just like a whatever the common element it would be.

What is the common element that would be exist in both RDDs? So here it is 2497, here it is 9104. So it would be 4 and 9. That would be common. I will show you how will apply. So this would be my num and this would be my num2. So first I will just place a num here.

First I place a num here and num2 in the intersection. So I will see the result. So it is giving me 4 and 9 because these are only two elements which are getting repeated in both data sets.

So these are the common bit. Now I will apply this. So result would be same 4 9 4 9. So order does not matter whether I am using num here or here.

And similar intersection there is a subtract as well. But in subtract how it will perform I will tell you it is subtracting num2 from the num1 So num1 is this and num2 is this. So whatever the elements are common.

So I will show you the subtract. How subtract works in RDD. So these are my two RDDs. First is num and num2. So these are the elements which exist in both RDDs.

So this is first num minus num2. If I will subtract num2 from the num. So whatever the data elements exist in the num. I will subtract the num2 elements from this. First I will see whether we have 1 in this data set.

There is no 1. So I will just type 1. Then I will see whether 7 is there. No. So when I will see 9. So 9 is there.

I will not write it. 4, 4 is there. I will not write it. 10 is there.

So I will write 10 here. And 15 is there. So I will write 15. And same case is with num2.

Minus num. So I will see whether 5 is there. No, 5 is not there. 5 is there.

No, 5 is not there. 4, 4 is there. Yes, 4 is there. I will not write it. 3, 3 is not there.

So I will write. 2 is not there. So I will write 2 and here 9 is there.

So I will not write 9. 2 is there. So I will write 2. So 1, 7, 10, 15. So here I am 1, 7, 10, 15 and here is 5, 5, 3, 2, 2. So 5, 5, 3, 3, 2, 2. So it would be like this. And now I have distinct.

So distinct is like a unique element. Whatever the unique element in exist in RDDs. This is 5, 5, 4, 3, 2, 9, 2 and I will apply distinct. So it will give me that 5 is unique, 4 is unique, 3 is unique, 2 is unique.

9 is unique. So it has replaced two elements. So now I have completed wide transformations on simple RDDs.

But there are some transformations which are left that we have to apply on the key value pair. So I will explain key value RDD in my next video. Reduce by key, group by key, join and there are some more which we have to apply on key value pairs.

So that's all in this video. If you have any doubt, you can comment below and I will try to solve it and i hope you enjoyed the video if that video is informative to you please like this video and subscribe my channel if you have not subscribed don't forget to press the bell icon to get the notifications of further videos in your inbox so see you in the next video till then goodbye enjoy happy life

Transcript for:Spark RDD Transformations Overview

Transcript for:
Spark RDD Transformations Overview