PySpark and RDDs Overview

hey hello welcome back to my youtube channel it's ranjan and this is third video of apache spark playlist series and in my these two videos i have covered basic of big data hadoop map reduce and apache spark and i have covered what are the drawbacks or the problem in the hadoop map reduce due to which we are using apache spark extensively i have covered the difference between apache spark and a map reduce if you have not watched these two videos please go through these two videos and link is here so let's start this video without wasting any further time So now we will discuss what is PySpark. So it is just a collaboration of Python and Spark that is PySpark. You have already seen what are the benefits, what are the features that we are getting in the Spark. So Apache Spark community released a tool so that Python developers can use Spark. They named it as PySpark. So Apache Spark, they have released the interface to allow Python programmers to just use a functionality of Apache Spark in order to leverage the services of the big data. and it got all possible due to library called p5 4j we have already discussed this is python for java so this helps to convert our code java into python and vice versa and to use py spark we would have to install python and apache spark on our machine in our next video i will show you how we will install python and apache spark rdd was the main component in the apache spark that is resilient distributed data set. So by using python we can use RDD in our Python programming language. So now it comes the difference between PySpark and the Pandas. So till this point you must be using Pandas to store, process or manipulation of the data frames. But you must be worked on the data set of size which is far less than the size of the your RAM. So when the data set increase beyond the size of your RAM, in that case we have to use PySpark. I will tell you how and why Spark is different from the Pandas. So I am showing this difference just to provide the clarity so that you can be aware why I am using Spark. It will give you better picture so you would be able to visualize the thing. So first as we have already seen in spark we have different nodes. We have many nodes and that nodes work in cluster and there is a technique which is parallelization which take place in the PySpark. So due to that we can apply operation on the parallelization. But in pandas it is single machine tool and it does not have any parallelism built in. which means it will use only one core of your CPU suppose you have many cores inside your CPU so in that case PySpark can distribute the workload to all the cores but in pandas it will just use your one core and operations in PySpark are lazy in nature they use lazy evolution I have already covered lazy evolution in my last video but in pandas it is not the case we will get the result as soon as we apply any operation even in case of read in case of read all the data goes into your RAM that is main memory. And this is the main point in spy spark. We have RDD. So RDD is just equivalent to the data frame that we have in pandas. So RDD which is data set in our pie spark in pie spark RDD is immutable. We cannot change the data. We cannot alter the data which is inside the data frame. We can transform we can create new RDD by using this RDD, but we cannot change the existing RDD and this property is very useful in pie spark. This is not the drawback. this is just a property and this property is definitely creating a safe platform to share across processes among different nodes because if it is immutable we don't have a fear we don't have a problem that it will get changed it will get distorted it rules out a big set of potential problems due to updates from the multiple threats at once but in pandas df it is not immutable in nature we can change anytime and in PI spark API supports very less type of operation as compared to pandas, but in PySpark it is also getting developed and it has reached some point and in pandas it supports more operation than PySpark data frame. We can say pandas API is as of now is more powerful than spark but spark is faster and in PySpark complex operation in PySpark is not easier to confirm. It's somewhat complicated as compared to pandas and in PySpark data frame as such is slower. But processing is fast. We have already seen that processing is fast and SS is slower because it has stored into multiple nodes. That's why SS is slower. But in Pandas, SS is faster because when we are trying to read CSV, it got already stored in the RAM. But processing is slow and Pandas data frame is faster because it uses local memory and the primary memory SS is always faster. But that is limited to available memory and we can scale this. We can. create new node in the cluster. It is horizontally scalable and in PySpark we have to create a cluster but in pandas we don't need to create any cluster we can directly use pandas and PySpark is a little bit complicated as compared to pandas because it's a simple more flexible it has more function it is easier to implement. So this is all the differences that we have seen between PySpark and pandas. Now comes resilient distributed dataset. it is the fundamental data structure unit of the Apache spark it is the main unit due to that spark becomes so much famous what it means resilient distributed data set so resilient means that it is fault tolerant and it is capable of rebuilding data on failure so that means when machine is performing some operations on the data you don't have any fear that my data will get distorted my data will get corrupted my data will be destroyed what it does it creates a copy inside its memory so that when you are performing some operation on that data even in that case your data got corrupted so it will not destroy the its original copy that means it is resilient and second one is distributed we have already seen that data is distributing among nodes of the cluster and third is data set so data set is like a collection of the data you You can say data frame or you can see a file. You can say any file. So it is just a collections of the data with some values in organizing some table and second is immutable and the follows lazy transformation. We all knows like what is lazy evolution it is covered in the previous video and immutable that we cannot edit the RDD and this property is very useful. I will show you how RDD looks like in next slide and it runs and operate on multiple nodes to parallel processing due to we can apply parallel processing on different nodes a spark RDD can be cache and manually partitioned I will show you how we will create partitions in the RDD and there are two main operation that we can apply on the RDD so first operation is transformation and second is action so in transformation it is helped to create a new RDD by applying some functions there are many functions some are like filter group by a map by so transformation are those operation by which we will work on the input data set and apply some function and it will create a new rdd and in case of action it is just a set of instruction that we will give spark to perform some computation and calculation and whatever the result we will get from these computations the result will send back to the driver so driver is like a master node so driver is the main program which will run all the processes I will show you in the next slide so this is the structure of RDD how it looks like so I can create many RDDs in my spark cluster so I am taking an example of one RDD so this is my RDD one so what it will do it will create a many partition it depends upon how much number we are giving how much big data it is so suppose I have created a four partition one two three four and I have four nodes in my cluster or I suppose I have many nodes but it is just taking a only four node in these partition will get stored on the four nodes and it have some memory partition so every data set it will stored into the rdd suppose this my data set is getting stored in rdd and this is getting divided into multiple logical partitions and these distribution of partition is done by the spark so user don't have to worry about computing the right distribution so this is my driver node we can take it as master node and these are my worker node w1 w2 w3 and w4 so this is my one rdd rdd1 and in this you will see like it has stored this rdd on four machines so this is my partition 1 this is partition 2 this is partition 3 this is partition 4 so this is my one data set and i can store another data set like this so this would be rdd2 and i can store same structure like this so it will be RDD 3 So it will create partition 1 partition 2 partition 3 partition 4. So first of all my data set would get converted into RDD. Then this RDD would divided into multiple partitions and that partition will be stored on different nodes of the cluster. So now we have important features of the RDD due to which we are using PySpark extensively. So first feature is in memory computation. So I have Already explained this feature and lazy evolution. It has already been covered in previous video and now fault tolerance The spark RDD are the fault tolerant because they track data lineage information So it's a graph which is called data lineage graph and it stores all the metadata of the data set like what all the? Information are stored where so whatever the information whatever the partitions are stored in which node of the cluster It helps to rebuild the data which got lost between operations and each RDD remembers like how it got created because even we apply transformations on the RDD it got created so it got created by using some transformation it could be either map or join or a group by there are many operation and fourth is immutability so we cannot change the RDD in between operations we can only transform we can only create new RDD and due to this property it is safe to share across processes because the data done. would be shared among different different nodes and fifth is partitioning it is the basic unit of the parallelism in the spark rdd because due to this only pi spark can perform operation in parallelization because the once the data is partitioned once the data is distributed among different nodes then only we can apply operations in parallelization each partition is one logical division of the data it will present that subset of the data and we can create partitions through using transformation and six is persistence so we can define a state by which rdd they will reuse suppose i have to use this rdd many times so i will create a persistence for that rdd that this rdd should be stored in the ram for some amount of time or i can define this this rdd should be stored into the ram or this rdd should store into the hard disk if i'm using that rdd very frequently i will say this ID should store into the RAM in the main memory but i am using some rdd very less then i can define that rdd should store on the persistence and seventh is coarse grained operations so these are the very rough operations and it applies to all elements in the data set that's why it is called coarse grained and they have many operations like map filter group by flat map sample union join i will show you in further slides and it is location stickiness. So RDDs are capable they have some they have feature by which we can define placement prefers to the computing partition so placement preference here we are referring to the information about the location of the RDD so that what would be the location of that RDD in that node so there is a another graph which is DAG scheduler graph so it will help to define the place of the partitions in such a way that whatever the task we have to perform should be close to the data as much as possible so there won't be any latency so it speeds up the computation so there are the two graphs so first is DAG and second is data lineage graph it stores the metadata information and it is just stored the places of the partition like where that RDD will get stored and DAG scheduler performs many applications other than this I will explain you about the tag in my further videos and in my next video i will show you like how we will set up and how we will install apache spark and hadoop and then how we will use pi spark in our jupiter notebook or we will use in pi charm so that's all in this particular video and i hope you enjoyed this video if you like the content please like this video and do share this video with your colleagues and friends so please subscribe my channel if you have not subscribed and Don't forget to press the bell icon to get the notifications of my further videos in your inbox So see you all in the next video till then goodbye. Enjoy. Happy learning

They named it as PySpark. So Apache Spark, they have released the interface to allow Python programmers to just use a functionality of Apache Spark in order to leverage the services of the big data. and it got all possible due to library called p5 4j we have already discussed this is python for java so this helps to convert our code java into python and vice versa and to use py spark we would have to install python and apache spark on our machine in our next video i will show you how we will install python and apache spark rdd was the main component in the apache spark that is resilient distributed data set. So by using python we can use RDD in our Python programming language. So now it comes the difference between PySpark and the Pandas.

So till this point you must be using Pandas to store, process or manipulation of the data frames. But you must be worked on the data set of size which is far less than the size of the your RAM. So when the data set increase beyond the size of your RAM, in that case we have to use PySpark. I will tell you how and why Spark is different from the Pandas. So I am showing this difference just to provide the clarity so that you can be aware why I am using Spark.

It will give you better picture so you would be able to visualize the thing. So first as we have already seen in spark we have different nodes. We have many nodes and that nodes work in cluster and there is a technique which is parallelization which take place in the PySpark. So due to that we can apply operation on the parallelization. But in pandas it is single machine tool and it does not have any parallelism built in.

which means it will use only one core of your CPU suppose you have many cores inside your CPU so in that case PySpark can distribute the workload to all the cores but in pandas it will just use your one core and operations in PySpark are lazy in nature they use lazy evolution I have already covered lazy evolution in my last video but in pandas it is not the case we will get the result as soon as we apply any operation even in case of read in case of read all the data goes into your RAM that is main memory. And this is the main point in spy spark. We have RDD. So RDD is just equivalent to the data frame that we have in pandas. So RDD which is data set in our pie spark in pie spark RDD is immutable.

We cannot change the data. We cannot alter the data which is inside the data frame. We can transform we can create new RDD by using this RDD, but we cannot change the existing RDD and this property is very useful in pie spark.

This is not the drawback. this is just a property and this property is definitely creating a safe platform to share across processes among different nodes because if it is immutable we don't have a fear we don't have a problem that it will get changed it will get distorted it rules out a big set of potential problems due to updates from the multiple threats at once but in pandas df it is not immutable in nature we can change anytime and in PI spark API supports very less type of operation as compared to pandas, but in PySpark it is also getting developed and it has reached some point and in pandas it supports more operation than PySpark data frame. We can say pandas API is as of now is more powerful than spark but spark is faster and in PySpark complex operation in PySpark is not easier to confirm.

It's somewhat complicated as compared to pandas and in PySpark data frame as such is slower. But processing is fast. We have already seen that processing is fast and SS is slower because it has stored into multiple nodes. That's why SS is slower.

But in Pandas, SS is faster because when we are trying to read CSV, it got already stored in the RAM. But processing is slow and Pandas data frame is faster because it uses local memory and the primary memory SS is always faster. But that is limited to available memory and we can scale this. We can.

create new node in the cluster. It is horizontally scalable and in PySpark we have to create a cluster but in pandas we don't need to create any cluster we can directly use pandas and PySpark is a little bit complicated as compared to pandas because it's a simple more flexible it has more function it is easier to implement. So this is all the differences that we have seen between PySpark and pandas.

Now comes resilient distributed dataset. it is the fundamental data structure unit of the Apache spark it is the main unit due to that spark becomes so much famous what it means resilient distributed data set so resilient means that it is fault tolerant and it is capable of rebuilding data on failure so that means when machine is performing some operations on the data you don't have any fear that my data will get distorted my data will get corrupted my data will be destroyed what it does it creates a copy inside its memory so that when you are performing some operation on that data even in that case your data got corrupted so it will not destroy the its original copy that means it is resilient and second one is distributed we have already seen that data is distributing among nodes of the cluster and third is data set so data set is like a collection of the data you You can say data frame or you can see a file. You can say any file.

So it is just a collections of the data with some values in organizing some table and second is immutable and the follows lazy transformation. We all knows like what is lazy evolution it is covered in the previous video and immutable that we cannot edit the RDD and this property is very useful. I will show you how RDD looks like in next slide and it runs and operate on multiple nodes to parallel processing due to we can apply parallel processing on different nodes a spark RDD can be cache and manually partitioned I will show you how we will create partitions in the RDD and there are two main operation that we can apply on the RDD so first operation is transformation and second is action so in transformation it is helped to create a new RDD by applying some functions there are many functions some are like filter group by a map by so transformation are those operation by which we will work on the input data set and apply some function and it will create a new rdd and in case of action it is just a set of instruction that we will give spark to perform some computation and calculation and whatever the result we will get from these computations the result will send back to the driver so driver is like a master node so driver is the main program which will run all the processes I will show you in the next slide so this is the structure of RDD how it looks like so I can create many RDDs in my spark cluster so I am taking an example of one RDD so this is my RDD one so what it will do it will create a many partition it depends upon how much number we are giving how much big data it is so suppose I have created a four partition one two three four and I have four nodes in my cluster or I suppose I have many nodes but it is just taking a only four node in these partition will get stored on the four nodes and it have some memory partition so every data set it will stored into the rdd suppose this my data set is getting stored in rdd and this is getting divided into multiple logical partitions and these distribution of partition is done by the spark so user don't have to worry about computing the right distribution so this is my driver node we can take it as master node and these are my worker node w1 w2 w3 and w4 so this is my one rdd rdd1 and in this you will see like it has stored this rdd on four machines so this is my partition 1 this is partition 2 this is partition 3 this is partition 4 so this is my one data set and i can store another data set like this so this would be rdd2 and i can store same structure like this so it will be RDD 3 So it will create partition 1 partition 2 partition 3 partition 4. So first of all my data set would get converted into RDD. Then this RDD would divided into multiple partitions and that partition will be stored on different nodes of the cluster. So now we have important features of the RDD due to which we are using PySpark extensively.

So first feature is in memory computation. So I have Already explained this feature and lazy evolution. It has already been covered in previous video and now fault tolerance The spark RDD are the fault tolerant because they track data lineage information So it's a graph which is called data lineage graph and it stores all the metadata of the data set like what all the?

Information are stored where so whatever the information whatever the partitions are stored in which node of the cluster It helps to rebuild the data which got lost between operations and each RDD remembers like how it got created because even we apply transformations on the RDD it got created so it got created by using some transformation it could be either map or join or a group by there are many operation and fourth is immutability so we cannot change the RDD in between operations we can only transform we can only create new RDD and due to this property it is safe to share across processes because the data done. would be shared among different different nodes and fifth is partitioning it is the basic unit of the parallelism in the spark rdd because due to this only pi spark can perform operation in parallelization because the once the data is partitioned once the data is distributed among different nodes then only we can apply operations in parallelization each partition is one logical division of the data it will present that subset of the data and we can create partitions through using transformation and six is persistence so we can define a state by which rdd they will reuse suppose i have to use this rdd many times so i will create a persistence for that rdd that this rdd should be stored in the ram for some amount of time or i can define this this rdd should be stored into the ram or this rdd should store into the hard disk if i'm using that rdd very frequently i will say this ID should store into the RAM in the main memory but i am using some rdd very less then i can define that rdd should store on the persistence and seventh is coarse grained operations so these are the very rough operations and it applies to all elements in the data set that's why it is called coarse grained and they have many operations like map filter group by flat map sample union join i will show you in further slides and it is location stickiness. So RDDs are capable they have some they have feature by which we can define placement prefers to the computing partition so placement preference here we are referring to the information about the location of the RDD so that what would be the location of that RDD in that node so there is a another graph which is DAG scheduler graph so it will help to define the place of the partitions in such a way that whatever the task we have to perform should be close to the data as much as possible so there won't be any latency so it speeds up the computation so there are the two graphs so first is DAG and second is data lineage graph it stores the metadata information and it is just stored the places of the partition like where that RDD will get stored and DAG scheduler performs many applications other than this I will explain you about the tag in my further videos and in my next video i will show you like how we will set up and how we will install apache spark and hadoop and then how we will use pi spark in our jupiter notebook or we will use in pi charm so that's all in this particular video and i hope you enjoyed this video if you like the content please like this video and do share this video with your colleagues and friends so please subscribe my channel if you have not subscribed and Don't forget to press the bell icon to get the notifications of my further videos in your inbox So see you all in the next video till then goodbye.

Enjoy. Happy learning

Transcript for:PySpark and RDDs Overview

Transcript for:
PySpark and RDDs Overview