Transcript for:
Spark Memory Management Overview

[Music] do you know that understanding memory management can help you solve a wide variety of optimization problems in spark so memory management has been a crucial topic a tricky topic for most of us to understand but at the same time it is very important in order to develop a solid grasp over two things the first one is how spark Works in and the second one is which portion of the memory is responsible for storing what so I'm going to make all of this very easy for you to understand hopefully in this video so let's get started so we are going to talk about these three topics in the video the first one of course is going to be executor memory management the second one is going to be unified memory what is Unified memory why is it even called Unified the third one is off Heap memory the the least talked about and often the least used memory but it can be very useful in certain situations okay so now let's Deep dive into the executor memory layout so here we have a spark executor container and we see that there are three major components of memory right the first one is the on Heap memory which occupies the largest block over here the other one is the off Heap memory which is this one and the last one is the overhead memory right so an important point to note is that most of Sparks operation is run on the on Heap memory which is managed by the jvm so for those of you who don't know what the jvm is the jvm stands for Java virtual machine and it's like a virtual computer that is used for running Java programs so spark is written in Scala which runs on the jvm but jvm serves as an execution environment not just for Java but also for other languages including Scala so when you're writing code in Python using py spark you're basically what you're doing is basically you're using a wrapper around Java apis of spark yeah so you're using a wrapper around Java API of Park and this is essentially done to be able to use all of the features of Park but the underlying execution that happens is still going to happen on the jvm yeah it's still going to happen on the jvm and that is the reason why on Heap memory is managed by the jvm so the on Heap memory is divided into four sections right the first one is execution memory execution memory is the place where your joins shuffles sorts and aggregations happen right this is that portion of the memory the other one is the storage memory the storage memory is the place where your caching happen right and this caching can be either rdd or data frame caching right and it's also used for storing broadcast variables now this portion of the memory together is called the unified memory it's called the unified memory we'll come into more details regarding why it's called the unified memory but now for now we'll cover the other parts so the third part is user memory this part of the memory is basically used for storing user objects basically the variables or the collections like list sets dictionaries that you define in your program or even for storing udfs user defined function the last one is the reserved memory that spark needs for running itself and for storing internal objects the overhead memory if I were to just simplify it and say it is used for some internal system level operation and the last bit of it is the off Heap memory again we are going to cover this in a lot of details going ahead okay so let's take an example to understand which portion of the memory is going to occupy how much of space yeah so let me quickly remove all of this and let's assume that when we ran our spark jobs we pathfi the executor memory 10 GB right so basically essentially this value that we' specified spark. executor do memory we specified it to be 10 GB right now there is a value which is called spark memory fraction and that defines how much of space this execution memory is going to take from here to here right and that is simply 6 of the total value 6 of 10 so is going to be 6 into 10 GB which is 6 GB yeah so this whole portion from here to here is going to be 6 gab now in order to calculate how much the storage memory is going to be you have another parameter which is spark. memory do storage fraction and this can be adjusted based on needs but the default is 0.5 so that would simply mean half of 6 G GB which is going to be 3 GB yeah and this is going to be 3 GB because 6 GB minus 3 GB 6 GB was the total from here to here and this one is 3 gab so the remaining is 6 GB minus 3 GB which is 3 GB yeah now this portion this whole portion that you see is 10 GB minus 6 GB because 6 GB was this portion right so this is going to be 4 GB this whole portion is going to be 4 GB now if we want to find out how much is the user memory from here to here this is simply going to be 4 Gab - 300 MB and for simply Simplicity let's assume that 1 GB 1 gab equal 1,000 MB this is simply going to be 3,700 MB which is 3.7 GB yeah and the last one as we already know is 300 MV yeah so these are the final values that you would have when you were to allocate memory within a cluster yeah so again just to reiterate your whole size was 10 GB you ended up having 6 GB total over here which is from here to here now this portion of the memory occupy 3 GB this also occupies 3 GB and this portion this whole portion is 10 GB - 6 which is 3 + 3 so this is going to be 4 GB this whole portion and to calculate the value only for this one you just have to do 4 gbes minus 4 GB - 300 MB which is 3.7 GB yeah so that is how the calculations are going to look like for the on Heap memory yeah now when we Define executor memory this is completely separate from the offhe memory and the overhead memory right no portion of memory that we Define over here is going to be allocated over here yeah so if we wanted to calculate the overhead memory the overhead memory is simply defined by maximum of 38 4 so it's maximum of 384 or 10% of spark executed memory so spark executor memory is 10 GB 10% of it is going to be 1 GB maximum of 384 and 1 GB 384 MB or 1 GB and this is going to be 1 GB so your overhead is going to be 1 gbyte and coming to the off Heap memory the off Heap m memory is disabled by default so you see that this parameter over here is set to zero you can of course set it to a nonzero number and basically let's say allocate memory over here but the way off Heap memory is structured this also has two parts the first one is execution and the second one is storage so it looks looks the same as this one as unified memory yeah so normally it's disabled but if you want to use it you basically have to go ahead and set this a good start can be something like setting 10% 10 to 20% of your executor memory so 10 to 20% of this 10 GB which is the executor memory you can set this value and start um experimenting around this right so now one possible question that you may have is the executor memory is 10 GB but 10 GB is only allocated to this portion of the memory right so where how does these two portion get memory like how is memory allocated to these two portions right so a very important point to note is that executor memory allocates only for on Heap and when spark request the cluster manag such as the Yan for memory what it is going to do is that it is going to add the executor memory and the overhead and if off memory is dis uh enabled it is going to add that and then it is going to request for memory so in this case I assume that off memory is disabled overhead is 1 GB executor memory is 10 GB so it is going to request the cluster manager for 11 GB of data for this container yeah so I hope all of that makes sense it has become a little messy with all these numbers over here but I hope this give you Clarity on how the numbers are divided between each portions of the memory okay so now we are going to talk about unified memory and this portion of the memory the ex execution and the storage memory together is called the unified memory as we've seen earlier as well right and the reason why storage and execution together is called unified is because of dynamic memory management strategy of spark simply meaning that if execution wanted more memory it can simply use some of storage memory and if storage wanted more memory it could simply go ahead and use executions memory but there is a priority which is given to execution memory because all of the important operations like join Shuffle sorting and group bu they happen in the execution memory so basically what I mean to say is that this slider that you see over here this slider is movable it can go up or it can go down based on whichever portion needs whatever memory it can go up and down yeah so I wanted to talk a little bit about how the scenario was before spark 1.6 and after spark 1.6 so before spark 1.6 let's assume that this whole Space is for the unified memory is for the unified memory and what we see over here is that the space that was given to execution and storage memory was fixed so this was fixed right the bar was not movable you cannot move here and there so if execution needed more memory let's say whatever memory it had it was already full and if it needed more memory despite storage being empty over here this portion is empty despite storage being empty it won't be able to use this portion of the memory and this is a big waste a big waste of memory and space that is available right so after greater than equal to spark 1.6 this slider that you see over here became movable right so it can move accordingly based on the needs of execution and storage memory so let's look at a few rules which would Define how the slider could move and where it could move right so let's imagine that execution needed more memory yeah so this was the extra memory that it needed so what it could do is that if there was available memory if there was vacant memory and we see here the storage is is not using this memory it can simply go ahead and utilize that portion of the memory yeah so this portion of the memory that is not being used by storage it can simply be used by execution and you see that it execution has already used some portion of storage memory yeah now coming to the second example now let's say that storage also has some blocks storage is occupied until here and execution is occupied until here yeah now in this case we see that execution needs more memory so execution is saying that hey I need to do some more joins and some more shuffles and aggregation give me some more memory yeah so in that case what storage is going to do is that storage is going to take some of his blocks yeah it is going to St some of it blocks and then it is going to evict those blocks in order to make room for execution memory and this eviction is going to happen using the lru algorithm the least recently used blocks are going to be evicted so this space that you see over here is now going to be available and then it is going to be used by the execution memory yeah now coming to the last case where we see that storage has indeed occupied a lot of memory right and execution is until here oh sorry execution is until here right and storage has occupied until this very part from here to here yeah now whatever jobs we are running whatever program we are running in that program we we need to catch a few more objects right we need to catch a few more objects and in order to catch a few mode objects we need mode of storage memory so storage is basically saying that hey I need Moree memory give me more space right so then in that case because execution has a priority and because critical operations are executed within execution memory none of the execution blocks are going to be evicted it is storage who has to evict its block based on lru and then this space whatever this space is is going to be freed up and then storage is going to store the new blocks the newly cached data over here in this block over here and this is simply because execution is given priority over storage right and apparently what we've also seen is that a lot of us do cat a lot of time right without properly understanding if a particular piece of code needs caching or not right for example there is a data frame and that data frame is never going to be reused but we still cach that data frame so there's no point catching such data frame right and that all also emphasizes on the fact that why it is important to give more priority to execution memory right because again all critical operations are happening on execution memory you can for sure catch things again on the storage memory so in this case to reiterate storage would have to evict its own blocks and then make peace for more memory and the new blocks that are going to be catch is going to be fitting in over here yeah okay so now before talking about the off Heap memory I wanted to reiterate that the on Heap memory is the place where majority of spark operations takes place right all kind of joins aggregations shuffles sorting and all of that right and the on Heap memory is managed by the jvm now in the event that the on Heap memory is full in the event that this is full there is going to be a garbage collection cycle right so there is going to be a garbage collection cycle and the primary use case for this is that it is going to pause the current operation of the program and then it is going to clean up all of The Unwanted objects in order to make room for the program to resume right so this pauses that GC does from time to time may take a tool on the performance of the program right and it is in this case where you may use the off Heap memory so it is in this case where this may come in very useful yeah now the off Heap memory is managed by the operating system and therefore it is not subject to the GC Cycles the garbage collection Cycles on the executor yeah however a very important point to note is because it is not subject to the garbage collection Cycles on the executor it is you it is the spark developer who has to take care of two things the first one is the allocation and the other one is the deallocation of memory again this adds complexity to the overall process but it is really important in order to avoid memory leaks hence the offhe memory is supposed to be used with a lot of caution because there's additional responsibility on your shoulders to do the allocation and deallocation of memory another caveat of this is that the offhe memory is slower than the off on Heap memory the on Heap memory is the closest to spark however if spark had two option the first one was to spill to dis or the other option was to use the offhe memory a better choice would be to use the offhe memory because writing to dis would be several orders of magnitude slower yeah so I believe that gives you a little bit of overview of what the off Heap memory is and how it can be used and in order to use it you basically have to first Spark dot have to set spark. memory do off heap. enabled to true the first one has to be set to true and then you specify the size of the op heat memory and as I explained earlier it is a good start to maybe start with 10 to 20% of your executor memory yeah it's great to see that you've reached the end of the video so to summarize we've covered executor memory management we've covered the executor memory layout we've also talked about the unified memory what is Unified memory why is it called unified and the rules that decide the movement of the slider between storage and execution memory and lastly the off Heap memory so I hope all of that made sense and if it made sense please don't forget to subscribe to my YouTube channel like the video share the video again thank you so much for watching