Transcript for:
Transformations and Actions in Spark

hi hi welcome to a2zetknowledge.com so today we are going to discuss about uh transformation and action in spark okay so i have already explained what is spark what is rdd and all those videos are available in my spark a playlist i have shared my spark playlist link in the description box of this video you can you can just go through all those videos so uh now like we are going to uh deep dive into the spark transformation and actions so with respect to rdd uh it has two major operation which we call it as transformation and then action okay and action in spark we do all whatever we do with data is transmissions right so you have to filter something you have to do join or you have to aggregate or you have to map something so everything we call it as a transformations right so and action so what is action so i want to perform something so you perform some transmission and finally you have to do some action for example i have to print it or i have to save to a file so something like that if you do we call them as action so if you are very new to spark and you have a question so how can i by seeing the code how can i say this particular function or piece of code is transformation or this particular function or piece is an action but that is something you can easily able to understand once you start doing your code activities so for example i will i will give you some transformations example here so map is a transformation by seeing that i can say a map is a transformation and then filter is a transformation etcetera and in action we have something like ah collect so collect will collect the data similar to the print statement uh so we have something called show and then there is something called save as file so save as file is something you save your output to a file so by seeing this yeah we can say yeah this is the action okay and spark rdd has an advantage called lazy evaluation what does that mean okay people are very familiar with this if you are very familiar yeah maybe you can skip ah like 40 to 50 seconds of this video from now so for others i am going to tell you what is lazy evaluation so in spark in mapreduce there is only two transmission map reduce that's it but here we do have lot of transformations right so the lazy evaluation is all about in spark you are writing a program imagine you are reading a file and then you are doing a map operation on top of it and you are making it as a key value pair or whatever you want and then you are applying and filter and finally you are performing an action like save as file you want to save this in hdfs or somewhere in your linux so now ah if you run this first three right in your spark shell or even in the ide when you run it so what the compiler will do is it won't do anything so when when there is an action only then the compiler will start your transformation so that means the bottom to top so that we call it as a lazy evaluation you can ask me why they have designed in that way see i'm using some memory for example 20 mb to read 50 mb memory to read sorry map and then like 100 mb to filter and finally you are not doing anything then that means why should i have to use all these resource and do all the transformation for you compiler is asking you when like why should i have to do all these transformations and and finally you are not doing any action that means why should i have to waste my resource by doing the transformation so here spark has been designed in a way we can call it as an intelligent way that your compeller will check in your code whether there is a piece of action mentioned or not if it is not mentioned your spark jaw will not start only when it finds the action it will start your very first transformations okay fine so now in transformation we have again two types we call it as narrow transformation and then one is narrow and another one is wider transmission and again a same question you may get in your mind by seeing the transmission how can i differentiate it whether it is narrow or wide so that is the next question for you so for this i will going to explain you with some example and we do have a practical as well so i am going to give you an example let me delete this okay so imagine you have a four task okay you have force task you your first thing you you are reading a file okay so ok so read is my first translations you have you you read a file and the output is your file and on top of this output the file which you have read on top of it you are applying in map operation okay this is your map output okay and then again you are making a filter operation okay so this is your filter output okay now if you see here all these are one to one right it's the mapping is similar like one to one mapping so read map filter we call them as one to one mapping and if if it is one to one and we call them as narrow transformations and i will explain you the white before that if you take map reduce right so you have like three mappers and two reducer in map reduce world okay so imagine this mapper may or may not can send the output of mapper to both the reducer when there is a condition to send them to both the radius right and the second mapper again it will send a part of output to first reducer and second reducer and again the third mapper again may or may not can send its output to both the reducer when when the logic well whatever the condition you have written in the reducer based on that it will share the data to this mapper is sharing the data to both the reducer so here what you see is shuffling in mapreduce we call something uh between map and reduces of shuffle shuffle is the core part and map reduce the heart piece right so imagine the same thing happens here now finally in the reducer how you will increase the task of reducer in map release code we can set the count 2 3 like that we can increase or even we can decrease and similar to that in spark also the final aggregation part you can decrease and increase and we do that via repartition there is a function called repartition with that you can able to increase your output task okay or you can decrease okay but for that we use colleagues but colleagues repartition and all i will explain you in my upcoming videos now imagine i'm i just wanted to run my aggregation path in two tasks not as four or five tasks but it's again up to uh the coders decision so here we call the final thing as reducer but here i cannot use that word because we do have so many other transformations not only reducer so that's why i'm trying to avoid that word when i am explaining you the same in spark okay i am having two output aggregation task and what you are going to do here i am going to do something called group by group by key i'm just want to group all the outputs now i'm just doing a word count just imagine i'm just doing a word count operation so i have high here and then hello okay and then here i have hi again and and i'm having w okay and here again i'm having high and here i'm having hello and then here again i'm having high and again w now so i need all words high and w to my first task of the group by key or in mapreduce term if i want to say i need high and w count to my first reducer and hello to my second reducer okay so that means this particular mapper will send high to this mapper and this particular mapper will send high and w to this reducer because it has both right our our our condition is we have to send high and w to this and even if you take third it has high so it is sending only the high and here high and w so it has to send both and if you take the first mapper again it is sending hello to this node because it has hello whereas in second task we don't need it and third yes it has hello and then fourth again we don't want it because it doesn't have hello so what happens here is shuffle so this so what before uh reaching to this reducer all these mapper outputs like this one and this one and this one and this one because all these mapper has all these tasks has high it's it's group all the high and send and all the w and send it to my first reducer and that's means shuffling what is shuffling in map reduce we used to say group by key short by key exactly the same thing happens in spark as well shuffle so shuffle means the data movement is happening from one node to one another node right so so this is also the transformation but this transformation requires a shuffling but if you see these three is also the transformation but there is no shuffling required it's just one to one so the transformation which doesn't have a shuffle we call them as narrow transmission and the transformation which has shuffle which we call it as wide transformation so that's that is the difference between wide and narrow transformation now by seeing a transformation list you can still able to say uh this transformation is narrow and this is white so which is easy i will show you that i will show you more transformation and action list for you okay so you can see here all these transformations are one to one and they are narrow and if you see all these transformation require shuffling so we call them as wide transmission and all these are actions okay now i will explain you uh with the word cone program yeah i have already explained about cone program but not with this respect to context okay i am explaining you with this now so first time i am going to explain this if you see here i'm greeting a file which is an arrow transmission then i'm doing a flat map to do a split operation which is again an arrow transmission and map i want to split it to key valuable which is again an arrow so i will copy this i will just run it here okay so i have executed all the narrow transformation and i have already the jobs page of spark so you can see one completed job this is this is something i have already done okay this is not the one which we are running now so if you refresh here nothing happens see we have executed one line right but we didn't get any new job for it right so i am explaining you the lazy evaluation now so now the next thing i am going to trigger the reduce by which is the aggregation that means this is called wide transformation i am executing this as well and if i again i refresh the page i am not getting any new job because as i told you only when the compiler checks for the action in the code then only your job will get start and now i am going to invoke an action so here the last one is b right so b dot collect okay if i do this i will able to see the output you can able to see here and channel is 2. now now if i refresh this particular thing yeah now i can i can able to see two jobs right so the one yeah this one is latest so so only when i trigger the action i can able to see the job getting started and let's get completed now i click this so if i if i click this i can able to see the dag okay for dag i have a separate video so the main use of dag is for each and every transformation what you trigger spark will maintain a data lineage it maintains what transformation happened before what happened what is going to happen next to it maintain all the information so in between if some transformation is failed or data loss happen it will recompute from the first previous transformation so to maintain all this it maintains we call them as a data lineage to avoid fault tolerant they are doing this for this we need dag for sure okay but again i'm going to make a separate video for dac but if the video is not available please wait i will upload soon but just i am giving a heads up of what is stack so if you see here we have two stages okay so here stage 2 and stage 3. so if you see in stage 2 we have reading a file and then flat map and map and in stage 2 we have reduced by key you can ask me why reduce by key has moved to a second stage that means a different stage i will tell you that see how the stages will work so ah for every narrow transformation so i am having five narrow transformations means for all five narrow transmission the stage will be one when when the dag finds there is a while transformation right then it will cut here it will create a new stage for this and it will name it as stage two or sorry stage one so whenever you there is a wide and narrow transmission the stages will split okay so and that's the reason i can able to see two stages here okay now i will just go for this second stage that means the first stage okay so why it is not zero and one or one and two because already one job created with the one zero and one stage that's why this job the stage name is two and three okay don't get confused so i'll go with this stage two i will click this you can see here in first stage reading a text file and then we did a map operation flat map and again a map right now i am going back and i click uh the second stage which is collect okay which is third stage okay two one three so if you see here it is mentioned as shuffle rdd that means uh since this particular transformation is shuffle operation it created a new stage okay so uh in this video we have completely we discussed about the transmission and its types and also the lazy evaluation part and we discussed about the dag as well right so thanks for watching atooz at knowledge.com if you really like this video please do subscribe my channel and forward this to your friends and colleague if possible it's my request please share this video in your linkedin as well and i have given my linkedin profile and my link uh instagram id in the description box we do videos in two languages english and tamil the complete spark playlist video is available in the description box and not only big data videos we do have other tech videos as well in my channel please have a look thanks for watching ato's at knowledge.com