Transcript for:
PySpark Actions and RDDs

hey hello welcome back to my youtube channel it's ranjan and this is apache pi spark playlist and i have already covered four videos if you have not watched these videos just have a look on this video you will be able to get the better clarity about the pi spark i have covered basic introduction of pi spark difference between big data and pi spark and what are the map reduce and how what is the difference between pandas and pi spark and in my fourth video i have shown you to install a pi spark in our local environment and in this video we will talk about actions I have shown you like what is RDD what is RDD in PySpark if you are not aware just watch third video of Apache PySpark playlist so first of all let me explain what is spark context so spark context is a main entry to achieve or to use the PySpark features so it is just a gateway by which we will use the spark functionality in our python environment So it will create a connection to the spark cluster. So first of all to start a PySpark we have to create a spark context in case of RDD. But in case of dataframe SQL when we will use spark SQL in that case, we will create a spark session. But in case of RDD we will create spark context. It is just a connection to the spark cluster by which we will use the functionality of the spark cluster in our python code. and only one bar context may be active. active as per code or per JVM, Java virtual machine. So in order to create a new cluster, we have to stop the Spark context that we have already created. So first of all, I will import two things, Spark context and Sparkcon. It is config function. And these two things exist in PySpark. So I have three ways by which we can create Spark context. So first of all, this is the way in which we can define some configuration. So this is the way in which I am defining some configuration. So this is the config variable and I am using Spark. cons and in this i am setting the name of my app as a youtube demo and setting master as local and now i will create a my spark context this is my function which i have to create in order to use the spark functionality so here i am taking a name as sca this is my variable name i can give any name and in that i am using conf equal to conf so this conf is a parameter and i am giving it this value to this parameter so it will assign that value so now i want to see what all are the configuration that i have configured for this spark context so i have successfully created my spark context and if i want to see what all are the configuration for the spark context so i can see spark master and this is local and driver host these all my system details and the what are the details that i have configured i can see app name is youtube demo and if i want to stop that i can use sc.stop So now this par context is stopped So this is the another way by which I can create Spark Context. In this I am not defining any configuration. So I'm just creating a Spark Context on the fly. So it has been created if I want to see the configuration. So these all are the configuration that Spark Context is taking by default. So I'm just stopping it. And this is the third way. In this I'm defining as a master as well and I'm giving a name as well. So it's a easy one. So in this I have just initialized. Now I'll talk about RDD. I have already explained a lot about the RDD. So it's an immutable distributed collection of objects and it could be divided into multiple partitions so that assessing these RDDs would be faster and we have many ways to create a RDD. We can create RDD by using SDFS storage or by using local storage or by using CSV. We can create RDD by using any storage. So RDD is much like a data set. Now we have a data set is ready. so on the data we apply machine learning so before applying any machine learning model we need to perform some tasks like we understand the data we pre-process the data we filter the data and we impute null values or find missing values and thus imputations and we calculate features we do feature selection so these all tasks we do in machine learning are known as operations and operation can be anything like sorting filtering summarizing data applying graph so these all are counted as operation and in spark can say pi spark operations are divided into two things so one thing is transformation and other thing is action and in pi spark transformations are lazily evaluated whenever you apply the transformation on any rdd it will create an rdd but it does not create at that time it creates whenever you apply the action on that rdd it is very important say interview questions what is the difference between transformation action transformation help us to create a new rdd Whenever we are applying a transformation, it will give an output as RDD. But whenever we apply action, it will give us a value, particular value. The value could be integer, string or anything. But in case of transformation, it will always create an RDD. So whenever we apply a transformation in any RDD, it will not perform the operation immediately. It will create a DAG that is directed a cyclic graph using the applied operation whatever we are trying. it will keep on building this graph using the references till you apply any action. If you are applying any action then it will execute all over the transformation that you have applied on the RDD. So that's why we called transformation in the PI Spark are lazy in nature. This picture is very clear you are seeing that whenever I am applying any transformation and on RDD it is creating a new RDD but if I am applying on the action on RDD it is giving me a value. Now the op- operations on the RDD are divided into two things as we have already seen actions and the transformation. So we have some examples in the action and in this particular video, we will talk about action. We will see how action perform using PySpark. But about transformation, transformation are divided into two things. So first is narrow transformation and other is wide transformation. I will explain transformation in my next video. So as of now, you can learn that these transformation can be divided into two things. I transformation and narrow transformation so you need to aware about what are the action the transformation examples because this has been asked in many interviews so first i will create a rdd and we have many ways to create an rdd now i am defining a manual names and creating a rdd so names is my variable and i am using sc that is my spark context that i have created and i'm using parallelize so that my data is distributed among the cluster among the nodes among the cores so these are the names that i am defining in the name so if I will see the type of the names it will show me that this is RDD so my RDD has been created so now if I have to check the name I need to type collect so once I will use names.collect this is a function this is a operation if you will see this is a first operation which is action if I am using action it is giving me a value right away you will come to see in my next video when I will apply transformation nothing will pop up but in case of action It is just giving me a detail right away. So it has been created a list. If I will store into a list, I will see the type of the A. It will show me that this is a list. And this is second function, second operation or can say second action. If I want to see like Adam, how many times Adam comes in this RDD. I will use names.countByValue. So it will give me count for everyone. So it is giving me Adam comes one, like Mark comes three times. Bill comes one. So this is an action. Now we have a for each action it is very unique operation in this because it takes each element and applies a function to everyone but it does not return any value so it is useful in creating a log so you will see that i am defining a function here and in the function i am giving a value to a by creating an rdd and i'm applying for each and this is lambda x i'm using lambda function and this lambda function in py spark is exactly same the function we use in python because the python only so you have to aware about the Lambda function if you are doing this PySpark so it will print it should print but it is not printing it if I will see the type a it would be none type but it would be useful to create a logs and you have seen that I have applied names dot collect but we don't use collect in our production environment because collect will give me all details whatever the details you have in your whole data set whole cluster if we are using PySpark we should not be using collect because it will give me all the data then there would be no point left to using PySpark because all the data would come into RAM and our system would collapse. So we have to use in that case take if I want to see first five values, I will use dot take. Now this is the first way to create a RDD by defining the manual file. Now I am using a text file which is stored in my directory. So I am creating a new variable which is employees and I am using sc and there is a function which is dot text file and I am defining that. name of the file now i will see type of employees it has created a rdd if i want to see what are the contents in employee rdd i will use employee.collect it will give me everything and if i want to see employed first name what is the first name and what are the count and what are the top five values so it will show me like this if I want to see top 19 it will show me all values we should never use dot collect because it is very computational expensive and if I want to see what are the distinct count like what are the unique counts in my data it is showing me that 11 names are unique so now I am taking a another example it is a number example so that we can get a better clarity so now here I am creating a new RDD just giving a number this is a list of number If I will see I have created so. I will see like how many times each value comes. It is showing me that 5 comes 2 times 4 comes 1, 3 comes once and 2 comes 2 times and 9 comes 1 time. So this is just the same as we have used in the name. So I will see that type of the name. It is showing me that RDD. So now I have a very important function which is called Gloom. It is just like a glue. What it does in Spark in general, Spark does not allow the worker to refer to the specific partitions of the RDD because in spark the information stored in different different partitions in each RDD but spark does not allow the worker to refer to the specific partition so gloom does it transform each partitions into a tuple I will show you how it will transform each partition into a different tuples of element so it will create a RDD of tuples it will store one tuple per partition so that worker can refer to the element of the partition by using index and we use glom function to make each partition into a tuple i will show you how how we will do this now here if i am applying glom function and dot collect you will see the difference here it is single bracket but here is double bracket so i will show you why it is like so if i will see like what are the more parameters in that i can provide in parallelize i will just type shift tab so it is saying that i can apply num slices so num slices that how many partitions you want to create so suppose I am defining one here it is doing as it was doing for no value so I will see like here so that means by default it takes one value but suppose I am defining a two here I am defining two here so how it will like so it is same as it was for one but when I will apply when I will check this you will see the difference it has created a two partitions because i have defined two here so it has created a two tuples if i will define three you will see here result is same but so this has been divided into three three list now it has been stored on three partitions if i want to get first index i can define zero so i am getting but in that case i cannot do such thing like if i suppose i want three so that's the if i want only one so i will type one it is just like a indexing if i want all So I can do slicing as well. But in that case I can apply only take but in case of take it will give me all information till that point if I am applying for so it will give me all information till fourth point. Suppose I want this only so that's why I am using glom. So it is just partitioning. So we use glom to make each partition into a different tuples. If I am using six here what it will do it will create a six partitions. I will show you. So it has created a six tuples 1 2 3 4 5 6 or it has error handling mechanism as well. If I am mentioning 9 here, but I don't have 9 values. So it has created a 9 partitions. You will see the two are blanks because I have only seven values. If I will check the type of num.glom, you will see like it is pipeline RDD. It was RDD, but this is pipeline RDD. And now we have simple actions. like if I want to check what is the maximum value in this So I can use num.max, num.min, minimum value, it will show me minimum value and it will give me mean. What is the mean of that data? Now we have very important term which is reduce. So reduce is exactly same as it exists in Python. So whatever the values are there, it will just combined as a associative or cumulative whatever the operation you are using here. If you are using plus here, it will add each value sequentially. I will show you suppose I have a value 1 2 3 4 5 so what it will do in case of reduce it will just add 1 plus 2 plus 3 plus 4 plus 5 it will add all the things so it will become 2 plus 3 3 plus 3 6 6 plus 4 10 it will give me a result 50 so in case of multiplication it will multiply all so 1 into 2 into 3 into 4 into 5 it will give me 120 so this is how reduce works so reduce is example of action. So there are many interview questions. in which they ask for difference between reduce and fold. So reduce and fold is altogether a same. But in case of fold it takes some initial value. I will show you how it performs. So this is my RDD and on this RDD I am using a reduce. In that case, I am using lambda and I am using two values. So first value will go into A and second value will go into B. So it will just add these two values. a plus b 5 plus 5 10 and now the 10 will stored into a a value and another 4 value will stored into b so it will add all the things and it will produce a sum it has produced a sum of 30 and in that case of multiplication i will apply multiplication here and it is not necessary that we have to apply lambda function only here but lambda is a single line function that's why i'm applying lambda function i can apply any function in that so it It has produced a multiplication of all the data. So it is from 0800. Now I am checking like which is the maximum number in this case. So it's a simple function. I'm taking two values 5 and 5 first it will compare first two values 5 and 5 5 X is greater than Y it will produce the output as X otherwise it will compare each and every value. So I will see it will show me 9 now I'm showing you that in case of reduce I can apply user defined function as well rather than the lambda. So here is user defined function. I'm giving a name my define taking two values a and b. I'm returning a multiply by 2 and plus B multiply 2. So I have run that function and now I am using my function reduce. So it will give me value. So this is my RDD and now I have to see first three elements of the this RDD. So what I will do num.take in case of random elements. I use take only but in case of ordered so first it will sort the RDD then it will give me result so in that case i will see first small three elements so it will be two two and three now it's about fold we have seen reduce how reduce works and trust me guys if you will try to search like how fold work you won't be able to search in the internet so i will show you how it works so this is my rdd 5543292 and if we were using reduce i was getting 30 when i am adding these and when when I was multiplying this. I was getting 10800. So now this is my fold case. Suppose I am taking 0. I am folding 0 times. So result would be exactly same even in case of multiplication. But if I will change into 1, what it happens, it will just add 2 in the term 30. So it is just adding 2 number in that number. So it is becoming 32. And when I am adding 2, it will add 4. And when I am adding 3, it is becoming 6. it is adding 6 so why it is happening so whenever we create rdd by default it creates only one partition because by default it takes only one so if i will see num dot glom dot so it has two bracket so that means it will add two so that's why it is adding two in this so it is becoming if i will one so it is adding two but if i am adding two terms so it will add two times to this RDD. You will get better clarity when I will show you with 2. Suppose I am asking this RDD to create 2 partitions. Okay, now this has created a 2 partition. Now I will apply 1 to this. So it will add 3 terms. I have added 1 and it is added 3 number to 30. It becomes 33. So you will see like here is 1 partition, here is 1 partition and altogether it's a 1 partition. So it is adding 3 terms. 1 to this and 1 to this and 1 as a collectively. So that's why it is adding 3 when I'm using 1. If I will use 2, it will add 6 terms to this. So I will show you in the paint 5 5 4. So this is the case. And now I am using 2 partitions. So how it will create? It will create like this. And in case I am using fold as 1. So what it will do? It will just add 1 to this and 1 to this and 1 to this. So, it will act. 3 it will add 3 folds whatever the additions of number would be it will just add 3 to the final calculation now the same is with multiplication as well suppose i am using 1 so if i will multiply 1 by the data result would not be visible because it will multiply 1 it would be same but if in case i am multiplying by 2 and here i have created only one partition by default so here are two bracket so what it will do I will show you so this was my RDD So it will multiply this term by 2 and this term by 2. So it would be 4. So whatever the result that was my result when I was using reduce. So it will multiply by 4 it will become 43200. So that's why I am seeing 43200. So that is fold. So it depends whatever the number you are taking here if I am taking 2. So now you will see the result would be change even in case of multiplication. it would add multiple times So now it is getting multiplied by 8. If you will multiply by this 8, so result would be 86400. And as I am using lambda in every example, it's not compulsory. I can use any function. I can use default function add which exists in the operator. Result would be same. Even I can use multiplication. So this is another way to create a RDD. I can give the range. So here I'm using new variable which is b and I'm using sc.parallelize I'm giving a range 1 to 10. So you have to select some elements 1 to 10 it will just count and it will give me RDD. I will show you so this is another way to create RDD. So in this video we have completed actions we have seen like what are the difference between action and transformation. So in the next video we will go through transformation and different types of the transformation. So transformation would be divided into two types narrow and divide. So we will cover this and we will see each example of these transformations. So that's all in this video and I hope you enjoyed the video. So please like this video and subscribe my channel. If you have not subscribed don't forget to press the bell icon to get the notifications of further videos in your inbox. So see you all in the next video till then goodbye enjoy. Happy Traveling.