Databricks & Spark: Optimizing Performance

welcome back to my channel I'm Brian caky and I'm excited to continue my series on data bricks at Apache spark Performance Tuning this is lesson two the loow hanging fruit video or the easiest fix in way to get your workloads optimized kind of before you even start trying to do much and that's by getting the resources correct right at the start before I jump in I want to thank my patreon supporters I couldn't do this without you so I appreciate all the support you give me and anyone else is interested I will leave a link in the description where are we going we're going to talk about why compute resources matter data breaks workspace standard or premium data brakes and Apache spark cluster architecture and then we'll discuss the hardware underneath that architecture which leads us into the discussion on how to optimize our cluster configuration step by step and I want to give a special focus on shuffles and Spills because those are usually the place you're going to have the biggest performance hits and finally we'll wrap the up why are we talking about this Brian what's going on what do we care about resources and whatnot especially in the cloud right well I was surprised when I started to think about this how really it's not that covered at least I haven't seen a lot of good documentation about what options to select for our clusters and what the effect is on each one I've seen it in bits and pieces and I did find a pretty good single page which I Consolidated recommending the type of nodes you should use depending on the workload but the main reason reason I want to jump into this is because it's very important and not usually covered very much now historically I go way back and the Agee old thing has always been the cost versus performance traditionally on the old Legacy you know SQL Server databases you could always throw more Hardware at the problem get a bigger bigger machine ADD faster this and faster that and generally that would get rid of performance problem but of course that's a trade-off lots of money versus performance with data breaks the idea is not to spend a ton of money as much as get the right balance as long as you get the right sizing and types of nodes running and the right number of nodes then you should get the best performance at the best cost and I want to emphasize like this is the single easiest thing you can do to improve your performance if You' got the wrong sizing the wrong resources being allocated to your workloads then anything else is probably not the best way to go so get this right I also want to emphasize that there are interdependencies between performance optimization and the compute selection resources and options and how they interact and what do I mean by that so for instance if you create a cluster and it doesn't support Unity catalog then you will not be able to take advantage of unity catalog features related to Performance similarly as we'll see when we get to creating your workspace some of the things you'll need to do to improve performance require the premium workspace sources standard so let's talk about that first this is is the screen you go to when you want to create an Azure data bricks workspace on Azure and I want to call your attention to this particular Point here it's the pricing tier now I don't know that there's a huge difference in pricing between premium or standard and I think at this point the general recommendation is except for maybe certain types of workloads maybe development or something go with premium it gives you role-based access controls and that is really good for security but there's a lot of other things it also helps you with for instance if you go with premium then you get to use Unity catalog it's requires premium if you want to do Delta live tables that also requires premium what are Delta live tables Brian I will put a link in the description to my riveting Delta live tables video glad you asked but Delta live tables is really a kind of automated way to create data pipelines and the nice thing about it is it does all kinds of cool services like tracking lineage and give you insight into it and it has some intelligence about what you're doing in other words it understands what the various data assets are that you're building as I mentioned it's automated so it takes care a lot of the work for you now whether it performs better or not I'm not sure it probably depends on what you're doing in the workloads but it does take care a lot of the issues of maintaining and running workloads so it's something you want to have available in many cases fortunately I had to research this I wasn't sure but Photon does not require the premium tier so that's a good thing because photon is just really good I was trying to get bad performance and I had Photon turned on and I just couldn't it was too smat so I couldn't fool it into giving me bad performance or I wanted to create a spill and it just wouldn't spill and I didn't want to spend a fortune to add so much data to do a spill so I'll show you a little trick I used but good news Photon does not require premium and I just want to call this out I found the links you can get these slides I will put a link to these slides in the description and you can go get them for yourselves and therefore you can get all of these links but this link is to show you in this blog as your data breaks documentation you do need to have premium if you want to use your D catalog and of course for Delta live table same thing so now we're going to get into the real meat of this discussion which is cluster architecture but before I get into that I found this great video and I am borrowing from this wherever I can uh by Daniel tomes I believe is the right way to say that hopefully this is about four years old but as far as I can see almost all of it if not all of it is still applicable great video much more digging into the weeds and going into the spark UI and finding ways to improve the performance of poorly performing workloads we will get into that as well I'm backing up a little bit and I want to create a sort of a broader set of videos that kind of end to end cover what to do to get good performance workloads part of which is fixing ones like Daniel talks about fixing ones that are not performing well but as I mentioned the best place to start is with the right resources being allocated I will put this link to his video in the description again highly recommend you watch it it's it's really like the best video I've seen out there just talking about how to optimize workloads and he does a really great sort of non-nonsense explanation which I really appreciate so let's talk about the cluster archit Ure and I've used this diagram in my previous video but I want to get more into the details now it all starts with you know we'll say Debbie here who's going to query information against a phone book which is in storage and I want to call attention to the fact that we have external storage so we have to consider the performance of our external storage but Debbie here writes a query select City and count from phonebook Group by city orderby city and what's going to happen is that query will go into the driver node a node mode in data bricks or spark is a virtual machine can be a physical machine but who uses real machines anymore so it's a virtu machine and it can have any number of core but you'll notice also the core are connected to disc maybe multiple to read and write and it also has you know memory and other things so it's a virtual machine with a lot of resources and that of course will affect performance the driveing node coordinates all of the work going on in other words it's going to take that query here and it's going to execute it and it's going to partition the data as needed distribute it over the cluster nodes and then have them all start executing in parallel now the core we'll talk a little bit more about but a core can support a single executor or a single task or a single partition however you want to call it at a time and so the number of core determines the level of parallelism you get lot of stuff in here a lot of places where this can get kind of bottlenecked right any kind of physical reads and wrs will bottleneck network issues can bottleneck you so there's a lot of things to consider that sends a result back and I forgot to mention I added this in after the fact but Ram is also important right how much memory do you have so the constraints I've talked about in the past is the hardware resources the software resources environment yada yada and we're going to get down to some more of those but the focal point really here is Hardware resources and that does also touch on the data braks runtime as well in versions but the main focus here is what are we giving it for resources this is sort of a deeper dive picture of the cluster architecture just showing two worker nodes emphasizing in the middle you can see here there's Network traffic that's going to affect your speed right if it's slow you can see our core and notice something a little more detailed here is our worker VM or node has two cord just like before but the memory is split some memory is used for storing things like caching data that you need to hold on to and another part of the memory is used for actually working and doing your work and when it runs out of memory it will write to dis it's called spilling and it typically will do that when there's a shuffle what's a shuffle Brian glad you asked a shuffle is when you ask it to do something that forces data bricks or spark to rearrange the data to accomplish your task that pretty much happens when you do things like joins because now it needs to collocate the two different tables you're joining and it needs to collocate the Keys and related data so it will move all the data around it's a very costly operation as they say because it's a lot of work for it to move all this data from one node to the other and often times when it does that it may need to temporarily write some things to disk in the process shuffles a really costly other things that can cause shuffles are aggregations when you do group buys order buys and things it needs to again move the data around so shuffles are something you can avoid but you do have to be very mindful of when you're doing any kind of coding on Apache spark right so we have our worker nodes and then you see the driver node and that is also constrained by how much memory it has how much is split between storage and working and also disc so it's also got the same kind of constraints and of course our external storage is something to be considering too because external storage could be like Azure data Lake Gen 2 which is a really good way to go if you're using Azure but the faster it can be if it's an SSD versus something like you know HDD you're going to get faster response es the faster it can read and write to disk it is an age-old axiom that any reading and writing to physical Hardware is slow writing to disk has always been a bottleneck in any kind of data Centric operations so to review this we have workers sizes right work a size and CPU type we're going to go into more details about that what does that mean but you actually have some control over the type of CPU you pick and also how much memory and how many core they have the memory uses and allocation is something that by default you won't control but you can do things to tweak that and maximize some optimization when you're running your actual jobs the local disc speed as I mentioned so these discs that are being written to makes a big difference how fast is it driver configuration like the data bricks runtime will make a big difference as well and how you configure the driver itself and finally external disc as I mentioned you want to make sure those external discs are obviously in the same region and in Azure I don't think you can not have them in the same region but also you have to consider any way you can get it to be as fast as possible is obviously ideal another thing to consider which may seem a little unintuitive is number of nodes and that generally it's better to go with larger machines and less nodes than lots of smaller machines why is that Brian why wouldn't I want just a thousand nodes well the deal is this they have to communicate right a cluster all works together and when you do a shuffle for instance which is when it's going to need those nodes it's going to have to start sending data from one virtual machine to another and when it does that it's going to take a lot of network traffic so having these separate machines means it's going to do a lot of work and that creates sort of a bottleneck the more it has to keep moving data between your virtual machines so minimizing the number of nodes is probably a good idea and I've seen recommendations on data bricks blogs and things to that effect so let's talk a little bit more about the hardware starting with CPU and memory you have CPU and GPU types and and which one you will use depends on what workload you're doing right if you're doing your typical data transformation and movement types of things then you'll probably stick with CPU if you have heavy duty machine learning going on good chance you want to go with a GPU because gpus are optimized for heavyduty mathematical calculations which is exactly what you do in machine learning what about the number of cars Brian well to summarize this again a core equals an Executor equals a task equals a partition so in other words it's going to tell you what degree of parallelism you can get it's a really important item and you generally don't want to leave core wasted too you got 300 core out there but you're only using five not a good idea and sometimes as the video I mentioned explains people do that and end up not leveraging all of the cord they could be let's talk about memory you need enough memory to support your workload and it's pretty easy not to get that I'm not going to get into all the details here about how the memory is used but in storage but if it's storing a lot in caching and holding a lot in memory and then you're also doing a lot of crunching within that memory then you can run out and then you'll hit with spills and potentially even crash the cluster so you want to make sure you get that right you want to avoid spills that's always a good idea and of course like I mentioned you're going to be splitting between working and storage memory input and output Network there's a few things about networks not all of this you can really control but it's good to be aware of latency is a big one right local storage versus cross Regional storage which I don't think you could do with that anyway but maybe the other Cloud you could versus on-prem which is also going to require you know on-prem resources sending into Azure it's going to add overhead and and latency issues the bandwidth how fast is all of the hardware of course the weakest link anywhere in there is going to slow things down and if you're using external resources like Cosmos DB Azure SQL synapse or any of those things then you need to consider the latency those particular Services have inherently now Cosmos B is like supposed to be super low latency so you should get really good response as a SQL may not be as good that way and it generally is going to kind of Be by default more single-threaded there's techniques you can use to get parallelism even in the retrieval of data out of things like Azure SQL you can also try to push down some of your queries so that it filters and does as much work on the Azure SQL side before it sends the data into Data bricks so let's let's look at storage so we talked about this but basically you've got HDD and SSD and SSD is going to be faster you want the fastest storage possible you want it also locality right the closer it can be the less networking it needs to do the better as far as iio try to avoid spills and anytime that spark has to read and write to disk sometimes you you know you really need this you want to write something out but you have to be mindful of the fact that whenever any system relas did status Centric is writing to disk slows things down a lot another thing to be mindful of is what's the datab bricks runtime version you're using the version often determines what features you can take advantage of so I believe 9.2 data bricks runtime or higher is required to use Photon we'll talk about Photon but it's it's a very highly optimized super way to get fast execution so you need to have a certain runtime version for that and other things will also be affected so optimizing cluster configuration so this screen shows you the Azure datab bricks workspace and I went to the compute screen and within that I said I want to create a new cluster and this is the screen you get now I am going to walk through this so fear not and we'll cover all of these options this screen is an extraction and summarization of the blog link you see at the bottom of the screen but based on the use case or workload type Azure datab bricks recommends the type of compute you should create so starting with analysis which is generally going to be interactive development they recommend using a single node with high memory and a lot of core why is that Brian why would you do that because you're probably not doing you know pedabytes of data processing if you're in real time and if you can avoid a lot of work going between the nodes probably good idea especially since when you're doing analysis you're probably doing lots of joins lots of aggregations and so requiring these shuffles is going to slow things down cost you more money so I think that's the reasoning behind that it may be that you need to go to multiple nodes but I think this is a pretty good idea because you can pick some pretty large VMS behind a node and work with that they point out here too you're likely going to be reading the same data repeatedly so the node types are storage optimized and disc cach enabl in other words that's what they recommend what about basic ETL and the way it's defined there is whether or not you have whyde transformation what's a whyde transformation honestly I don't like that term because I don't think it means much but basically means this do you need a shuffle that's it a narrow transformation does not require a shuffle it doesn't require a sort or grouping or join that's the basic bottom line so an ETL job that does not have a requirement for a shuffle no wide Transformations should use compute optimized if you have more complex meaning you're requiring shuffles like you're doing a join then you use compute optimized but with less nodes and as I said before the reason for Less nodes is that when it does the shuffle it has less machines to have to pass data around to which should improve performance and save you some money machine learning training experimentation and development they recommend a single node type with high memory and core which is the same as for analysis when you're doing machine learning in production again minimal working nodes storage optimized with Dish caching enabled or maybe you're going to want to use GPU GPU core are great but I called this out of the notes just to be aware it lacks support for disc caching which could also impact performance couple of options too which are not really directly related to Performance but good to know since we're covering creating clusters is you can save a lot of money if you use something called spot instances what's a spot instance a spot instance is really like think of it like the bogging basement you got this room and it's just got stuff in it and it's like Microsoft is saying well if you if you grab one of those off that shelf over there we don't have to do anything for you then uh we'll give you a deal you can see under recommendation I took that part of the cluster definition screen so you could see that it's just a checkbox you want to use spot instances I don't believe spot instances support the same slas as non-spot instances and they're probably not best suited for critical production workloads but wherever you can use them and you can sort of accept not as much reliability and guaranteed delivery of uh the virtual machine you can save some money on it so that's a good reason to use it the next thing to consider is serverless compute this is a very hot topic these days in data bricks and the nice thing about serverless computer is it's like having clusters just sitting waiting to be used but you're not paying for them that's kind of the idea think of cars and a pocking lot all warmed up and all you got to do is say I need one I need two and they're ready to go now data bricks is actually the one running those and holding on to them for you not really Azure specifically there's they're there on stand bu and you can grab them databooks does not charge you for that at least the documentation says that but there may be some cost through Azure so why would I use serverless compute Brian save time not so much performance but you got to wait for those clusters to start up and sometimes you don't want to right they can take up to 5 to seven minutes to bring a cluster up and let's say it's just a 10-minute job so you waited seven minutes we'll say and then you wait 10 minutes for the job to run and then the cluster disappears with serverless compute you get nearly instant anous clusters available to you and finally an option which I mentioned is Photon that's just a checkbox on the screen we'll see that in a minute but it provides vastly faster processing what does Photon do Brian photon is really a rewrite of The Spark engine that's that's how I see it it's like they started out with spark and they said we're going to use Scola we're going to use Scola and that means we're going to use the Java virtual machine and they built it all out and it's great and it scales really well and of course Java is very powerful it's a really great language for writing stuff that you can control like this but once you start really wanting to get fast speed Java wasn't designed to be like machine learning speed the fastest reasonable language to use would be something like C and that's exactly what photon is so Photon blows past all of the issues and limitations that go using jvms on the cluster nodes and instead replaces most but not all at this point with C code C developed libraries the cool thing is it still works within this Scala API in other words the cluster doesn't really know it's using Photon in a sense it's sort of masked over it and the reason for that is so that they can sort of gradually replace parts of the engine without breaking things at least that's my understanding of how this is working and why so you want to start replacing things you want to get that super high-powered engine but you don't want to break everybody's workloads in the meantime and you certainly don't want to force people to rewrite the code so the idea with photon is you don't rewrite it and Photon will use its enhanced super fast libraries when it can when it can't it will just go back to the regular Scala but it's not relying on the jvm nearly as much as it used to and it's orders of magnitude faster so let's talk a little bit about access mode because another thing that really doesn't come up all that much but the first thing to realize is you can do and I'm kind of skipping the multi- node versus single node multi node is going to be your production workloads mainly where you have a lot of data and you need to have multiple nodes so you have to pick one or the other which you like multi- node or single node but you also have this idea of access mode right so we see access mode where I've highlighted with the red square and in my case it would have your account so my bfy at yada yada yada when I pick single user I have the option you can see below under Advanced options you have to expand this tab below so you can see the options within it but one of them is a checkbox to enable credential pass through for user level data access what that means is if if I have blob storage or ADLs Gen 2 storage or I have some other resources I need to connect and use from my data Brooks workspace I can authenticate to that resource by having data bricks automatically pass my credentials through so it knows who I am it says oh I know Brian you're allowed to use this and that's really convenient when you're doing things especially more in like development mode right you're in a developer workspace you're trying to get some things working and you don't have to go and grab a your key Vault keys and play with all these different settings and use secret SC you can just pass your credentials directly through and and get things that you need and it is primarily I think more of a development phase type of feature one of things to mention though is because it is a single user meaning if you want to have a multi-user node you cannot use pass to because it doesn't know who is who it is to pass in this case it's just me and knows I'm the only one using this cluster and again a single user note is pretty common for development purposes because then I can spin up my clusters and tear them down and I don't affect other people but sometimes people want to share a cluster if you do shared with isolation I believe is the second one that keeps everybody nicely organized and separate but it's limited to using python SQL and if you do the third option and I'm talking like about here the no isolation shared supports all languages we mentioned before this is the cluster creation screen you can see the check box for use Photon acceleration check that off and you'll notice that in the runtime summary at the right corner it adds the word Photon there's not a lot of reasons I can think of not to just check it off and use it unless you're like me trying to get something to crash for a demo purpose I would say just usually turn it on and use it and play with it because it's really cool and it's just a amazingly powerful how fast it makes things run so let's talk about the worker types here you can see this is a standard very vanilla kind of VM and the DS3 I believe maps to the types of VMS that Microsoft offers so it's kind of a size and capability of the VM and you can see here 14 GB of memory total memory for that VM and 4 or core I believe that's the lowest cheapest VM you can get and not the most powerful for you doing a workload you can also set what's called autoscaling and you do that you can see at the bottom here right underneath the driver type enable autoscaling when you do that you set the Min workers and Max workers the Min workers is what it will automatically start when it creates the cluster it says okay you got two for instance in this case and if the workload demands more resources it will keep adding nodes until it reaches the maximum in this case eight nodes so it's kind of just this range what's good about doing that Brian money it saves you money if you can get by and it only needs two nodes then it won't create more and you don't pay for more but if it starts to need more nodes available then it will add them in it will use them for the duration of the load in which it needs it and then once the need for them goes away it will eventually realize I don't need this after a timeout period or something it will figure out I don't need this anymore and it will scale the workers down again so that's why it's a really good idea if you're going to do multi- noodes I think using the enable autoscaling is a good feature to use in many cases now was the worker types the driver you always have only one driver so we have to create also a VM to be The Driver you can see that it automatically has the same type as the worker that's the default there are actually good reasons why you may want to make the driver bigger than the worker notes if you're going to be doing a lot of like nonscaled out code for instance you're using python libraries without finding the equivalent spark version if you're crunching through a lot of data you want to do a lot of collects back to the driver maybe you're doing a lot of Json where you're exploding things and bringing it local and doing a lot of that stuff maybe you're doing some sort of iteration that's just a local function training local models all good reasons why you might want to have a stronger more powerful driver and by the way when you see this LTS next to things it means long-term support the general rule of thumb is that production workloads should use the latest LTS when you're creating it new and if you're doing development data brick recommends using actually the latest versions so that you'll be testing out your code against what will become eventually the next LTS so you don't want to test it against an old LTS and find out that something you're doing you know won't work in a future release at least that's my understanding why they say that we can see this slide about getting started with Photon and the main takeaway here is it has to use datab bricks runtime 9.1 LTS or above also not all all instant types of driver and worker notes are supported with Photon photon is available only on data bricks in other words if you're using open source spark you will not have Photon hopefully they will eventually migrate it there another thing I want to call out is when you're creating your nodes you have the option of picking Delta cache accelerated this has been renamed to disc cache but the bottom line with this is by using this it automatically caches data that it feels is appropriate and this improves reading of paret files which means Delta files as well so if you have Delta tables data lake house and things going on then having Delta cache accelerated will help you and the nice thing is you don't have to specify when to Cache things it will do it automatically you always have the option with or without that turned on to cach things you want to encode so let's take a look at the runtime that we want to pick for our load the standard is on the left and you can see it's kind of like a folder structure right here standard versus ml standard would be your typical ETL elt type types of workloads and the ml obviously supporting machine learning the biggest thing I noticed between these two is that when you go to the machine learning side you have the GPU option which is highly efficient expensive mind you gpus are not cheap but they will give you really fast processing for your machine learning training you can also notice next to it when you pick a data bricks runtime which version of Scala and Spark you'll be getting and that can make a big difference too remember I mentioned what can I do in terms of performance well it might be here they all say Spark 3.5 but let's say spark 4.0 has all these cool optimization and enhancement features for performance you're not going to be able to take advantage of those so there is a trade-off there and once you pick it and it's running in say production you're probably going to be careful about swapping it out you'll have to test things carefully so picking the worker type here you can see this is like the lowest end possible standard DS3 V2 that's typically what I start with because I'm playing around in development mode to start with Delta cache enabled but if you look down at the bottom you can see the options here 16 gig all the way up to 192 gab of memory four core versus 48 so just in that small range and there's many you have a lot of options as to how powerful your node is and that makes a big difference in terms of how much it can achieve parallelism how much work it can do larger nodes mean less nodes required the driver as I mentioned you may want to go with bigger ones you can see there lots of options as well here and again I mentioned this make this larger if you want to do more local work without having to s of scale out in your cluster I'm going to show you this in a minute but under Advanced options you can set spark configurations many of which can help you optimize your workloads you can see here like Max partition bytes is being set as a sort of hidden example they just kind of dummied it in there but these are the kinds of things you could do to improve performance I'm not going to be talking about that specifically here but if you need to do things like that you can do it that way and then it's automatically when the cluster boots it will apply those selections right finally yay we're getting to Cluster shuffles and Spills now I'm showing this because I had to fudge data bricks to cre a spill I actually took what I thought was a fair amount of memory no matter what I did it wouldn't spill and that's kudos to data bricks darn that Photon does not want to spill data seems to get through anything and of course you also have adaptive query execution which means when it sees it's running into a problem it figures it out and fixes it before it runs it so I had to force this or I'd be spending a fortune trying to trigger a spill what I did is in the spark config as I had mentioned I just set the option spark. executed. memory 1 gig now my machine had 14 gig this brought it down to one and I was able to force a spill now in order to see The Spill and this is the whole point of this you need to go into the spark UI and it will show you that it's spilling data so I ran this query and I couldn't fit all of it on the screen so the first cell above is I created a data frame by calling out to spark SQL and of course I want to trigger a shuffle so I'm using the built-in data set tpch and I'm joining line items to orders to customer I'm doing a group buy I'm doing a left joins because left joins are far more overhead typically than inner joins so I'm doing that and then that's returning a spark data frame and as you know if I don't do anything after that spark actually won't execute it just says yeah I'm ready to go and you are you need to force it to retrieve data so the second cell over on the right where it says python display SP PDF sales when I do that it's basically doing some sort of a collect it has to execute code it's an action it will run my job now here's where it shows me if I go into this expand under it you'll see the spark jobs this query is creating you can see it's doing a lot of different work here lots of jobs and stages what we can do is click on The View link to drill into this now in my case I wasn't sure where it might spill so I had to poke around a little bit and I'll show you how I figured out where it's spilling so when I got into one of the jobs I just took a look and said okay in this case job 21 I scroll down to the area below here you see pool name stage ID pool name description and I clicked on this meaning list name all dollar sign dollar sign thre captured but that showed me what's behind it and I knew to look at that because I can see on the right side shuffle read Shuffle right that's what told me H maybe it's spiled here it's 150.2 megabytes read and 81.8 written so let's see what it shows us and the main point here is it shows me that spill memory is 8116 megabytes spill disc is 75.3 megabytes so it had to write to dis and it had to use some extra memory for spilling spills generally will degrade performance it won't necessarily cause real issues like in this case it's still small enough that it was able to easily handle it but it's something to keep an eye on and I believe when you start to see spills it's a good time to consider increasing the memory available and the node you're using so that it can do more without spilling that should help in many cases improve performance and by the way notice it says in the blue squares below here aqe Shuffle read all that aqe means adaptive query execution and so that's where it's actually analyzing and redoing its own plan to get the optimal outcome pretty cool so wrapping up we've talked about why compute resources matter and I hope that's been made clear this is the foundation upon which everything is resting so of course you've got to get it right and you have lots of options it's easy to adjust it but you need to monitor that and make sure it is right we talked about standard or premium and honestly I'll shortcut it to just go with premium we discussed data breaks in Apache spark cluster architecture we had that neat diagram with Debbie submitting the query and kicking off the cluster and all that good stuff and then we looked sort of at all of the hardware under that and how it affects the performance of your work then I walk through optimizing step-by-step different parts of creating your cluster and the options you pick to make it optimal for your workload then I did a little bit of a deep dive on shuffles and Spills because shuffles are really probably the most intense part of Spar work meaning it will be most likely the thing that causes you the most performance issues and therefore you want to look at that and Spills because spills are when shuffles are not able to do all of the work easily so they have to spill some data and anytime you're writing to physical Hardware like diss it's going to slow things down that's it for this time I want to thank you for watching please like share subscribe and until next time I'm pulling for you we're on this together thank you

welcome back to my channel I&#39;m Brian caky and I&#39;m excited to continue my series on data bricks at Apache spark Performance Tuning this is lesson two the loow hanging fruit video or the easiest fix in way to get your workloads optimized kind of before you even start trying to do much and that&#39;s by getting the resources correct right at the start before I jump in I want to thank my patreon supporters I couldn&#39;t do this without you so I appreciate all the support you give me and anyone else is interested I will leave a link in the description where are we going we&#39;re going to talk about why compute resources matter data breaks workspace standard or premium data brakes and Apache spark cluster architecture and then we&#39;ll discuss the hardware underneath that architecture which leads us into the discussion on how to optimize our cluster configuration step by step and I want to give a special focus on shuffles and Spills because those are usually the place you&#39;re going to have the biggest performance hits and finally we&#39;ll wrap the up why are we talking about this Brian what&#39;s going on what do we care about resources and whatnot especially in the cloud right well I was surprised when I started to think about this how really it&#39;s not that covered at least I haven&#39;t seen a lot of good documentation about what options to select for our clusters and what the effect is on each one I&#39;ve seen it in bits and pieces and I did find a pretty good single page which I Consolidated recommending the type of nodes you should use depending on the workload but the main reason reason I want to jump into this is because it&#39;s very important and not usually covered very much now historically I go way back and the Agee old thing has always been the cost versus performance traditionally on the old Legacy you know SQL Server databases you could always throw more Hardware at the problem get a bigger bigger machine ADD faster this and faster that and generally that would get rid of performance problem but of course that&#39;s a trade-off lots of money versus performance with data breaks the idea is not to spend a ton of money as much as get the right balance as long as you get the right sizing and types of nodes running and the right number of nodes then you should get the best performance at the best cost and I want to emphasize like this is the single easiest thing you can do to improve your performance if You&#39; got the wrong sizing the wrong resources being allocated to your workloads then anything else is probably not the best way to go so get this right I also want to emphasize that there are interdependencies between performance optimization and the compute selection resources and options and how they interact and what do I mean by that so for instance if you create a cluster and it doesn&#39;t support Unity catalog then you will not be able to take advantage of unity catalog features related to Performance similarly as we&#39;ll see when we get to creating your workspace some of the things you&#39;ll need to do to improve performance require the premium workspace sources standard so let&#39;s talk about that first this is is the screen you go to when you want to create an Azure data bricks workspace on Azure and I want to call your attention to this particular Point here it&#39;s the pricing tier now I don&#39;t know that there&#39;s a huge difference in pricing between premium or standard and I think at this point the general recommendation is except for maybe certain types of workloads maybe development or something go with premium it gives you role-based access controls and that is really good for security but there&#39;s a lot of other things it also helps you with for instance if you go with premium then you get to use Unity catalog it&#39;s requires premium if you want to do Delta live tables that also requires premium what are Delta live tables Brian I will put a link in the description to my riveting Delta live tables video glad you asked but Delta live tables is really a kind of automated way to create data pipelines and the nice thing about it is it does all kinds of cool services like tracking lineage and give you insight into it and it has some intelligence about what you&#39;re doing in other words it understands what the various data assets are that you&#39;re building as I mentioned it&#39;s automated so it takes care a lot of the work for you now whether it performs better or not I&#39;m not sure it probably depends on what you&#39;re doing in the workloads but it does take care a lot of the issues of maintaining and running workloads so it&#39;s something you want to have available in many cases fortunately I had to research this I wasn&#39;t sure but Photon does not require the premium tier so that&#39;s a good thing because photon is just really good I was trying to get bad performance and I had Photon turned on and I just couldn&#39;t it was too smat so I couldn&#39;t fool it into giving me bad performance or I wanted to create a spill and it just wouldn&#39;t spill and I didn&#39;t want to spend a fortune to add so much data to do a spill so I&#39;ll show you a little trick I used but good news Photon does not require premium and I just want to call this out I found the links you can get these slides I will put a link to these slides in the description and you can go get them for yourselves and therefore you can get all of these links but this link is to show you in this blog as your data breaks documentation you do need to have premium if you want to use your D catalog and of course for Delta live table same thing so now we&#39;re going to get into the real meat of this discussion which is cluster architecture but before I get into that I found this great video and I am borrowing from this wherever I can uh by Daniel tomes I believe is the right way to say that hopefully this is about four years old but as far as I can see almost all of it if not all of it is still applicable great video much more digging into the weeds and going into the spark UI and finding ways to improve the performance of poorly performing workloads we will get into that as well I&#39;m backing up a little bit and I want to create a sort of a broader set of videos that kind of end to end cover what to do to get good performance workloads part of which is fixing ones like Daniel talks about fixing ones that are not performing well but as I mentioned the best place to start is with the right resources being allocated I will put this link to his video in the description again highly recommend you watch it it&#39;s it&#39;s really like the best video I&#39;ve seen out there just talking about how to optimize workloads and he does a really great sort of non-nonsense explanation which I really appreciate so let&#39;s talk about the cluster archit Ure and I&#39;ve used this diagram in my previous video but I want to get more into the details now it all starts with you know we&#39;ll say Debbie here who&#39;s going to query information against a phone book which is in storage and I want to call attention to the fact that we have external storage so we have to consider the performance of our external storage but Debbie here writes a query select City and count from phonebook Group by city orderby city and what&#39;s going to happen is that query will go into the driver node a node mode in data bricks or spark is a virtual machine can be a physical machine but who uses real machines anymore so it&#39;s a virtu machine and it can have any number of core but you&#39;ll notice also the core are connected to disc maybe multiple to read and write and it also has you know memory and other things so it&#39;s a virtual machine with a lot of resources and that of course will affect performance the driveing node coordinates all of the work going on in other words it&#39;s going to take that query here and it&#39;s going to execute it and it&#39;s going to partition the data as needed distribute it over the cluster nodes and then have them all start executing in parallel now the core we&#39;ll talk a little bit more about but a core can support a single executor or a single task or a single partition however you want to call it at a time and so the number of core determines the level of parallelism you get lot of stuff in here a lot of places where this can get kind of bottlenecked right any kind of physical reads and wrs will bottleneck network issues can bottleneck you so there&#39;s a lot of things to consider that sends a result back and I forgot to mention I added this in after the fact but Ram is also important right how much memory do you have so the constraints I&#39;ve talked about in the past is the hardware resources the software resources environment yada yada and we&#39;re going to get down to some more of those but the focal point really here is Hardware resources and that does also touch on the data braks runtime as well in versions but the main focus here is what are we giving it for resources this is sort of a deeper dive picture of the cluster architecture just showing two worker nodes emphasizing in the middle you can see here there&#39;s Network traffic that&#39;s going to affect your speed right if it&#39;s slow you can see our core and notice something a little more detailed here is our worker VM or node has two cord just like before but the memory is split some memory is used for storing things like caching data that you need to hold on to and another part of the memory is used for actually working and doing your work and when it runs out of memory it will write to dis it&#39;s called spilling and it typically will do that when there&#39;s a shuffle what&#39;s a shuffle Brian glad you asked a shuffle is when you ask it to do something that forces data bricks or spark to rearrange the data to accomplish your task that pretty much happens when you do things like joins because now it needs to collocate the two different tables you&#39;re joining and it needs to collocate the Keys and related data so it will move all the data around it&#39;s a very costly operation as they say because it&#39;s a lot of work for it to move all this data from one node to the other and often times when it does that it may need to temporarily write some things to disk in the process shuffles a really costly other things that can cause shuffles are aggregations when you do group buys order buys and things it needs to again move the data around so shuffles are something you can avoid but you do have to be very mindful of when you&#39;re doing any kind of coding on Apache spark right so we have our worker nodes and then you see the driver node and that is also constrained by how much memory it has how much is split between storage and working and also disc so it&#39;s also got the same kind of constraints and of course our external storage is something to be considering too because external storage could be like Azure data Lake Gen 2 which is a really good way to go if you&#39;re using Azure but the faster it can be if it&#39;s an SSD versus something like you know HDD you&#39;re going to get faster response es the faster it can read and write to disk it is an age-old axiom that any reading and writing to physical Hardware is slow writing to disk has always been a bottleneck in any kind of data Centric operations so to review this we have workers sizes right work a size and CPU type we&#39;re going to go into more details about that what does that mean but you actually have some control over the type of CPU you pick and also how much memory and how many core they have the memory uses and allocation is something that by default you won&#39;t control but you can do things to tweak that and maximize some optimization when you&#39;re running your actual jobs the local disc speed as I mentioned so these discs that are being written to makes a big difference how fast is it driver configuration like the data bricks runtime will make a big difference as well and how you configure the driver itself and finally external disc as I mentioned you want to make sure those external discs are obviously in the same region and in Azure I don&#39;t think you can not have them in the same region but also you have to consider any way you can get it to be as fast as possible is obviously ideal another thing to consider which may seem a little unintuitive is number of nodes and that generally it&#39;s better to go with larger machines and less nodes than lots of smaller machines why is that Brian why wouldn&#39;t I want just a thousand nodes well the deal is this they have to communicate right a cluster all works together and when you do a shuffle for instance which is when it&#39;s going to need those nodes it&#39;s going to have to start sending data from one virtual machine to another and when it does that it&#39;s going to take a lot of network traffic so having these separate machines means it&#39;s going to do a lot of work and that creates sort of a bottleneck the more it has to keep moving data between your virtual machines so minimizing the number of nodes is probably a good idea and I&#39;ve seen recommendations on data bricks blogs and things to that effect so let&#39;s talk a little bit more about the hardware starting with CPU and memory you have CPU and GPU types and and which one you will use depends on what workload you&#39;re doing right if you&#39;re doing your typical data transformation and movement types of things then you&#39;ll probably stick with CPU if you have heavy duty machine learning going on good chance you want to go with a GPU because gpus are optimized for heavyduty mathematical calculations which is exactly what you do in machine learning what about the number of cars Brian well to summarize this again a core equals an Executor equals a task equals a partition so in other words it&#39;s going to tell you what degree of parallelism you can get it&#39;s a really important item and you generally don&#39;t want to leave core wasted too you got 300 core out there but you&#39;re only using five not a good idea and sometimes as the video I mentioned explains people do that and end up not leveraging all of the cord they could be let&#39;s talk about memory you need enough memory to support your workload and it&#39;s pretty easy not to get that I&#39;m not going to get into all the details here about how the memory is used but in storage but if it&#39;s storing a lot in caching and holding a lot in memory and then you&#39;re also doing a lot of crunching within that memory then you can run out and then you&#39;ll hit with spills and potentially even crash the cluster so you want to make sure you get that right you want to avoid spills that&#39;s always a good idea and of course like I mentioned you&#39;re going to be splitting between working and storage memory input and output Network there&#39;s a few things about networks not all of this you can really control but it&#39;s good to be aware of latency is a big one right local storage versus cross Regional storage which I don&#39;t think you could do with that anyway but maybe the other Cloud you could versus on-prem which is also going to require you know on-prem resources sending into Azure it&#39;s going to add overhead and and latency issues the bandwidth how fast is all of the hardware of course the weakest link anywhere in there is going to slow things down and if you&#39;re using external resources like Cosmos DB Azure SQL synapse or any of those things then you need to consider the latency those particular Services have inherently now Cosmos B is like supposed to be super low latency so you should get really good response as a SQL may not be as good that way and it generally is going to kind of Be by default more single-threaded there&#39;s techniques you can use to get parallelism even in the retrieval of data out of things like Azure SQL you can also try to push down some of your queries so that it filters and does as much work on the Azure SQL side before it sends the data into Data bricks so let&#39;s let&#39;s look at storage so we talked about this but basically you&#39;ve got HDD and SSD and SSD is going to be faster you want the fastest storage possible you want it also locality right the closer it can be the less networking it needs to do the better as far as iio try to avoid spills and anytime that spark has to read and write to disk sometimes you you know you really need this you want to write something out but you have to be mindful of the fact that whenever any system relas did status Centric is writing to disk slows things down a lot another thing to be mindful of is what&#39;s the datab bricks runtime version you&#39;re using the version often determines what features you can take advantage of so I believe 9.2 data bricks runtime or higher is required to use Photon we&#39;ll talk about Photon but it&#39;s it&#39;s a very highly optimized super way to get fast execution so you need to have a certain runtime version for that and other things will also be affected so optimizing cluster configuration so this screen shows you the Azure datab bricks workspace and I went to the compute screen and within that I said I want to create a new cluster and this is the screen you get now I am going to walk through this so fear not and we&#39;ll cover all of these options this screen is an extraction and summarization of the blog link you see at the bottom of the screen but based on the use case or workload type Azure datab bricks recommends the type of compute you should create so starting with analysis which is generally going to be interactive development they recommend using a single node with high memory and a lot of core why is that Brian why would you do that because you&#39;re probably not doing you know pedabytes of data processing if you&#39;re in real time and if you can avoid a lot of work going between the nodes probably good idea especially since when you&#39;re doing analysis you&#39;re probably doing lots of joins lots of aggregations and so requiring these shuffles is going to slow things down cost you more money so I think that&#39;s the reasoning behind that it may be that you need to go to multiple nodes but I think this is a pretty good idea because you can pick some pretty large VMS behind a node and work with that they point out here too you&#39;re likely going to be reading the same data repeatedly so the node types are storage optimized and disc cach enabl in other words that&#39;s what they recommend what about basic ETL and the way it&#39;s defined there is whether or not you have whyde transformation what&#39;s a whyde transformation honestly I don&#39;t like that term because I don&#39;t think it means much but basically means this do you need a shuffle that&#39;s it a narrow transformation does not require a shuffle it doesn&#39;t require a sort or grouping or join that&#39;s the basic bottom line so an ETL job that does not have a requirement for a shuffle no wide Transformations should use compute optimized if you have more complex meaning you&#39;re requiring shuffles like you&#39;re doing a join then you use compute optimized but with less nodes and as I said before the reason for Less nodes is that when it does the shuffle it has less machines to have to pass data around to which should improve performance and save you some money machine learning training experimentation and development they recommend a single node type with high memory and core which is the same as for analysis when you&#39;re doing machine learning in production again minimal working nodes storage optimized with Dish caching enabled or maybe you&#39;re going to want to use GPU GPU core are great but I called this out of the notes just to be aware it lacks support for disc caching which could also impact performance couple of options too which are not really directly related to Performance but good to know since we&#39;re covering creating clusters is you can save a lot of money if you use something called spot instances what&#39;s a spot instance a spot instance is really like think of it like the bogging basement you got this room and it&#39;s just got stuff in it and it&#39;s like Microsoft is saying well if you if you grab one of those off that shelf over there we don&#39;t have to do anything for you then uh we&#39;ll give you a deal you can see under recommendation I took that part of the cluster definition screen so you could see that it&#39;s just a checkbox you want to use spot instances I don&#39;t believe spot instances support the same slas as non-spot instances and they&#39;re probably not best suited for critical production workloads but wherever you can use them and you can sort of accept not as much reliability and guaranteed delivery of uh the virtual machine you can save some money on it so that&#39;s a good reason to use it the next thing to consider is serverless compute this is a very hot topic these days in data bricks and the nice thing about serverless computer is it&#39;s like having clusters just sitting waiting to be used but you&#39;re not paying for them that&#39;s kind of the idea think of cars and a pocking lot all warmed up and all you got to do is say I need one I need two and they&#39;re ready to go now data bricks is actually the one running those and holding on to them for you not really Azure specifically there&#39;s they&#39;re there on stand bu and you can grab them databooks does not charge you for that at least the documentation says that but there may be some cost through Azure so why would I use serverless compute Brian save time not so much performance but you got to wait for those clusters to start up and sometimes you don&#39;t want to right they can take up to 5 to seven minutes to bring a cluster up and let&#39;s say it&#39;s just a 10-minute job so you waited seven minutes we&#39;ll say and then you wait 10 minutes for the job to run and then the cluster disappears with serverless compute you get nearly instant anous clusters available to you and finally an option which I mentioned is Photon that&#39;s just a checkbox on the screen we&#39;ll see that in a minute but it provides vastly faster processing what does Photon do Brian photon is really a rewrite of The Spark engine that&#39;s that&#39;s how I see it it&#39;s like they started out with spark and they said we&#39;re going to use Scola we&#39;re going to use Scola and that means we&#39;re going to use the Java virtual machine and they built it all out and it&#39;s great and it scales really well and of course Java is very powerful it&#39;s a really great language for writing stuff that you can control like this but once you start really wanting to get fast speed Java wasn&#39;t designed to be like machine learning speed the fastest reasonable language to use would be something like C and that&#39;s exactly what photon is so Photon blows past all of the issues and limitations that go using jvms on the cluster nodes and instead replaces most but not all at this point with C code C developed libraries the cool thing is it still works within this Scala API in other words the cluster doesn&#39;t really know it&#39;s using Photon in a sense it&#39;s sort of masked over it and the reason for that is so that they can sort of gradually replace parts of the engine without breaking things at least that&#39;s my understanding of how this is working and why so you want to start replacing things you want to get that super high-powered engine but you don&#39;t want to break everybody&#39;s workloads in the meantime and you certainly don&#39;t want to force people to rewrite the code so the idea with photon is you don&#39;t rewrite it and Photon will use its enhanced super fast libraries when it can when it can&#39;t it will just go back to the regular Scala but it&#39;s not relying on the jvm nearly as much as it used to and it&#39;s orders of magnitude faster so let&#39;s talk a little bit about access mode because another thing that really doesn&#39;t come up all that much but the first thing to realize is you can do and I&#39;m kind of skipping the multi- node versus single node multi node is going to be your production workloads mainly where you have a lot of data and you need to have multiple nodes so you have to pick one or the other which you like multi- node or single node but you also have this idea of access mode right so we see access mode where I&#39;ve highlighted with the red square and in my case it would have your account so my bfy at yada yada yada when I pick single user I have the option you can see below under Advanced options you have to expand this tab below so you can see the options within it but one of them is a checkbox to enable credential pass through for user level data access what that means is if if I have blob storage or ADLs Gen 2 storage or I have some other resources I need to connect and use from my data Brooks workspace I can authenticate to that resource by having data bricks automatically pass my credentials through so it knows who I am it says oh I know Brian you&#39;re allowed to use this and that&#39;s really convenient when you&#39;re doing things especially more in like development mode right you&#39;re in a developer workspace you&#39;re trying to get some things working and you don&#39;t have to go and grab a your key Vault keys and play with all these different settings and use secret SC you can just pass your credentials directly through and and get things that you need and it is primarily I think more of a development phase type of feature one of things to mention though is because it is a single user meaning if you want to have a multi-user node you cannot use pass to because it doesn&#39;t know who is who it is to pass in this case it&#39;s just me and knows I&#39;m the only one using this cluster and again a single user note is pretty common for development purposes because then I can spin up my clusters and tear them down and I don&#39;t affect other people but sometimes people want to share a cluster if you do shared with isolation I believe is the second one that keeps everybody nicely organized and separate but it&#39;s limited to using python SQL and if you do the third option and I&#39;m talking like about here the no isolation shared supports all languages we mentioned before this is the cluster creation screen you can see the check box for use Photon acceleration check that off and you&#39;ll notice that in the runtime summary at the right corner it adds the word Photon there&#39;s not a lot of reasons I can think of not to just check it off and use it unless you&#39;re like me trying to get something to crash for a demo purpose I would say just usually turn it on and use it and play with it because it&#39;s really cool and it&#39;s just a amazingly powerful how fast it makes things run so let&#39;s talk about the worker types here you can see this is a standard very vanilla kind of VM and the DS3 I believe maps to the types of VMS that Microsoft offers so it&#39;s kind of a size and capability of the VM and you can see here 14 GB of memory total memory for that VM and 4 or core I believe that&#39;s the lowest cheapest VM you can get and not the most powerful for you doing a workload you can also set what&#39;s called autoscaling and you do that you can see at the bottom here right underneath the driver type enable autoscaling when you do that you set the Min workers and Max workers the Min workers is what it will automatically start when it creates the cluster it says okay you got two for instance in this case and if the workload demands more resources it will keep adding nodes until it reaches the maximum in this case eight nodes so it&#39;s kind of just this range what&#39;s good about doing that Brian money it saves you money if you can get by and it only needs two nodes then it won&#39;t create more and you don&#39;t pay for more but if it starts to need more nodes available then it will add them in it will use them for the duration of the load in which it needs it and then once the need for them goes away it will eventually realize I don&#39;t need this after a timeout period or something it will figure out I don&#39;t need this anymore and it will scale the workers down again so that&#39;s why it&#39;s a really good idea if you&#39;re going to do multi- noodes I think using the enable autoscaling is a good feature to use in many cases now was the worker types the driver you always have only one driver so we have to create also a VM to be The Driver you can see that it automatically has the same type as the worker that&#39;s the default there are actually good reasons why you may want to make the driver bigger than the worker notes if you&#39;re going to be doing a lot of like nonscaled out code for instance you&#39;re using python libraries without finding the equivalent spark version if you&#39;re crunching through a lot of data you want to do a lot of collects back to the driver maybe you&#39;re doing a lot of Json where you&#39;re exploding things and bringing it local and doing a lot of that stuff maybe you&#39;re doing some sort of iteration that&#39;s just a local function training local models all good reasons why you might want to have a stronger more powerful driver and by the way when you see this LTS next to things it means long-term support the general rule of thumb is that production workloads should use the latest LTS when you&#39;re creating it new and if you&#39;re doing development data brick recommends using actually the latest versions so that you&#39;ll be testing out your code against what will become eventually the next LTS so you don&#39;t want to test it against an old LTS and find out that something you&#39;re doing you know won&#39;t work in a future release at least that&#39;s my understanding why they say that we can see this slide about getting started with Photon and the main takeaway here is it has to use datab bricks runtime 9.1 LTS or above also not all all instant types of driver and worker notes are supported with Photon photon is available only on data bricks in other words if you&#39;re using open source spark you will not have Photon hopefully they will eventually migrate it there another thing I want to call out is when you&#39;re creating your nodes you have the option of picking Delta cache accelerated this has been renamed to disc cache but the bottom line with this is by using this it automatically caches data that it feels is appropriate and this improves reading of paret files which means Delta files as well so if you have Delta tables data lake house and things going on then having Delta cache accelerated will help you and the nice thing is you don&#39;t have to specify when to Cache things it will do it automatically you always have the option with or without that turned on to cach things you want to encode so let&#39;s take a look at the runtime that we want to pick for our load the standard is on the left and you can see it&#39;s kind of like a folder structure right here standard versus ml standard would be your typical ETL elt type types of workloads and the ml obviously supporting machine learning the biggest thing I noticed between these two is that when you go to the machine learning side you have the GPU option which is highly efficient expensive mind you gpus are not cheap but they will give you really fast processing for your machine learning training you can also notice next to it when you pick a data bricks runtime which version of Scala and Spark you&#39;ll be getting and that can make a big difference too remember I mentioned what can I do in terms of performance well it might be here they all say Spark 3.5 but let&#39;s say spark 4.0 has all these cool optimization and enhancement features for performance you&#39;re not going to be able to take advantage of those so there is a trade-off there and once you pick it and it&#39;s running in say production you&#39;re probably going to be careful about swapping it out you&#39;ll have to test things carefully so picking the worker type here you can see this is like the lowest end possible standard DS3 V2 that&#39;s typically what I start with because I&#39;m playing around in development mode to start with Delta cache enabled but if you look down at the bottom you can see the options here 16 gig all the way up to 192 gab of memory four core versus 48 so just in that small range and there&#39;s many you have a lot of options as to how powerful your node is and that makes a big difference in terms of how much it can achieve parallelism how much work it can do larger nodes mean less nodes required the driver as I mentioned you may want to go with bigger ones you can see there lots of options as well here and again I mentioned this make this larger if you want to do more local work without having to s of scale out in your cluster I&#39;m going to show you this in a minute but under Advanced options you can set spark configurations many of which can help you optimize your workloads you can see here like Max partition bytes is being set as a sort of hidden example they just kind of dummied it in there but these are the kinds of things you could do to improve performance I&#39;m not going to be talking about that specifically here but if you need to do things like that you can do it that way and then it&#39;s automatically when the cluster boots it will apply those selections right finally yay we&#39;re getting to Cluster shuffles and Spills now I&#39;m showing this because I had to fudge data bricks to cre a spill I actually took what I thought was a fair amount of memory no matter what I did it wouldn&#39;t spill and that&#39;s kudos to data bricks darn that Photon does not want to spill data seems to get through anything and of course you also have adaptive query execution which means when it sees it&#39;s running into a problem it figures it out and fixes it before it runs it so I had to force this or I&#39;d be spending a fortune trying to trigger a spill what I did is in the spark config as I had mentioned I just set the option spark. executed. memory 1 gig now my machine had 14 gig this brought it down to one and I was able to force a spill now in order to see The Spill and this is the whole point of this you need to go into the spark UI and it will show you that it&#39;s spilling data so I ran this query and I couldn&#39;t fit all of it on the screen so the first cell above is I created a data frame by calling out to spark SQL and of course I want to trigger a shuffle so I&#39;m using the built-in data set tpch and I&#39;m joining line items to orders to customer I&#39;m doing a group buy I&#39;m doing a left joins because left joins are far more overhead typically than inner joins so I&#39;m doing that and then that&#39;s returning a spark data frame and as you know if I don&#39;t do anything after that spark actually won&#39;t execute it just says yeah I&#39;m ready to go and you are you need to force it to retrieve data so the second cell over on the right where it says python display SP PDF sales when I do that it&#39;s basically doing some sort of a collect it has to execute code it&#39;s an action it will run my job now here&#39;s where it shows me if I go into this expand under it you&#39;ll see the spark jobs this query is creating you can see it&#39;s doing a lot of different work here lots of jobs and stages what we can do is click on The View link to drill into this now in my case I wasn&#39;t sure where it might spill so I had to poke around a little bit and I&#39;ll show you how I figured out where it&#39;s spilling so when I got into one of the jobs I just took a look and said okay in this case job 21 I scroll down to the area below here you see pool name stage ID pool name description and I clicked on this meaning list name all dollar sign dollar sign thre captured but that showed me what&#39;s behind it and I knew to look at that because I can see on the right side shuffle read Shuffle right that&#39;s what told me H maybe it&#39;s spiled here it&#39;s 150.2 megabytes read and 81.8 written so let&#39;s see what it shows us and the main point here is it shows me that spill memory is 8116 megabytes spill disc is 75.3 megabytes so it had to write to dis and it had to use some extra memory for spilling spills generally will degrade performance it won&#39;t necessarily cause real issues like in this case it&#39;s still small enough that it was able to easily handle it but it&#39;s something to keep an eye on and I believe when you start to see spills it&#39;s a good time to consider increasing the memory available and the node you&#39;re using so that it can do more without spilling that should help in many cases improve performance and by the way notice it says in the blue squares below here aqe Shuffle read all that aqe means adaptive query execution and so that&#39;s where it&#39;s actually analyzing and redoing its own plan to get the optimal outcome pretty cool so wrapping up we&#39;ve talked about why compute resources matter and I hope that&#39;s been made clear this is the foundation upon which everything is resting so of course you&#39;ve got to get it right and you have lots of options it&#39;s easy to adjust it but you need to monitor that and make sure it is right we talked about standard or premium and honestly I&#39;ll shortcut it to just go with premium we discussed data breaks in Apache spark cluster architecture we had that neat diagram with Debbie submitting the query and kicking off the cluster and all that good stuff and then we looked sort of at all of the hardware under that and how it affects the performance of your work then I walk through optimizing step-by-step different parts of creating your cluster and the options you pick to make it optimal for your workload then I did a little bit of a deep dive on shuffles and Spills because shuffles are really probably the most intense part of Spar work meaning it will be most likely the thing that causes you the most performance issues and therefore you want to look at that and Spills because spills are when shuffles are not able to do all of the work easily so they have to spill some data and anytime you&#39;re writing to physical Hardware like diss it&#39;s going to slow things down that&#39;s it for this time I want to thank you for watching please like share subscribe and until next time I&#39;m pulling for you we&#39;re on this together thank you

Transcript for:Databricks & Spark: Optimizing Performance

Transcript for:
Databricks & Spark: Optimizing Performance