Transcript for:
High Performance Joins: Hash Joins

[Music] [Music] uh so this is gonna be the first of a three-part lectures or three lectures we're going to talk about how to do hash drawings or sorry how did you joins high performance joins so this will be the first one hash joins the spoiler is this is what you're going to want to use in almost any case uh next class after the midterm sorry after the spring break will be uh sort merch joins and then we'll finish up with uh multi-way joints which is a special case all right for the class project stuff um I posted on Piazza last week about project two project three I still need to upload the project three uh info page I will do that tonight um so project two if you haven't signed up for a uh a database system yet I haven't gone through and approved them I'll do that tonight as well uh please do that if you're looking for I will also put out a list of the ones that I think are interesting yes question like tonight and the proposal is Wednesday like what is the proposal like supposed to be it's what I talked about last class what you're going to do uh how are you going to do it how are you gonna test yeah it's like super high level not anything deep um it's not a three-page favorite no it's slides literally slides uh for five minutes uh and you can fake it by like trying to get your laptop to work for two minutes right like um so but and also too I saw people signed up that are still looking for teams I've talked to some of you I'll reach out to again individuals tonight and we'll start trying to coalesce and get everyone into groups and assign to projects in time okay and then I'll be coming on campus tomorrow if we I think there's a faculty candidate uh in the morning but I'll be around when you try to make something uh if you need to talk we try to make arrangements okay any questions about project three project three and project one was due last night uh some of you got uh 110 congrats um for those you didn't finish please please don't finish as soon as possible okay or reach out to me if something else is going on all right so today's class again we're going to focus on on hash joins so we're going to begin with the background of what we need to care about when we do uh not just hash joins but any joint algorithm or any any operator implementation in a database system and in general then we'll focus on how we do the parallel hash join and sort of Define the building blocks of the steps so the phases of doing the hash join then we'll say how do we talk about how to build a hash table what do the hash functions look like what do the hashing scheme look like and then we'll finish up with the evaluation uh really only two two graphs from the paper you guys read which was a sort of a deep analysis across a bunch of these different hash joining algorithms that have been proposed in the last decade and really trying to understand what makes one better than another because they would have contradicting results saying sometimes one one implementation was better than another and so this paper you guys read was trying to clear the error to say okay within a single platform a single system a single implementation which has join approach is going to be the best and the spoiler is going to be the no partitioning one is going to be the best so I'm going to teach you Radix partitioning and and other ways to optimize that uh the partitioning phase but in general like turns out no partitioning is is probably going to be the best thing to do because to get the tuning right for it for the you know what what are the you know the the number of passes and the the size of the bits you want to look at in the Radix all getting all that right per Hardware is is non-trivial so most systems as far as I know are just going to implement the new partitioning approach okay so today's lecture is about how to joins on on a single node so I don't have a slide discussing this but think about again we're focused on single node execution at this point in the semester so be what we are aware that we could potentially running in a distributed system where there is some higher level orchestration that is moving data around between nodes as needed so today's class is really like okay well somebody else has got the data to my node and I'm going to do the join on on the data that's local to me so we're not going to worry about like if I have to go pull things over the network uh if someone's already done the shuffling for me or did a broadcast join all that we ignore it's like the data's on my local box I want to run the join to produce output and then where it goes next at this point we don't care all right so again and also emphasizing that we're only looking at how to do joins between two relations or two tables uh using multiple threads to paralyze their operations we'll look how to do multi-way joints or like three or more table joins uh after spring break so to do a join there's basically two approaches there isn't any magic to this it's going to be either hash join or assortment's join and the basic idea is that we want to avoid having to do Brute Force sequential scans over our our two input tables or relations over and over again so we either build a temporary data structure like a hash table to help us do quick lookups to find matches or restart the sort the data and then just walk them uh with iterators and checking across right so we're not going to discuss nested looped ones because in an olap system you almost never want to do this because the data you're trying to join is not is going to be large and chances are you're probably not going to have an index already already in it right if you can't decorate it yeah yes you have to do Nest Loop join we can order that for now we'll come to that later um yeah what yes what we'll talk about nested queries I think we talk query optimization later on but best case scenario has join so um for LGB systems a lot of them don't actually even Implement hash joints like MySQL I think maybe MySQL 8 has it before yeah I think that maybe it just added but for a long time they didn't have it right because in most of the time O2 workloads index NASA Loop joint is what you're going to want to use and I just want to iterate to say like this is the index nested Loop join is conceptually the same thing as the hash join except traditionally when you say I'm doing an index NASA Loop join it's going to be on a B plus 3 index but they're basically doing the same thing there's some data structure that allows us to do quick lookups in the case of the B plus tree it'll be log n if it's the hash table average case of one right there's some data structure that's going to always have to do sequential scan over the entire data set to quickly find a matching table to do our join but in the case of Vanessa Loop join the index is already going to exist and then when the query is over the index still exists and then and the hash joins we're talking about here I'm going to spend all the time building this hash table probe it to do my join and then throw it away immediately after my query is done and then next query comes along and I'm going to build it again and that's okay uh there is some some research literature about saving the sort of the hash tables data structures from one query to NX it's sort of like a materialized view but not exactly the same um but fry purposes here assume we don't have existing hash table and we have to build it on the Fly you can do this in some systems they'll build a beautiful stream to fly I think SQL Server calls this the spooling index uh spolin scan index exploring index scan again they built they build a uh a b-plus tree instead of a hash table because you know for whatever the query is that's better but not everyone does that all right so hashing for sorting is like uh is a classic uh classic debate in databases that goes back to the 1970s we'll see another similar debate when we talk about uh query optimizers do I go top down or bottom up right there's sort of there's there's pros and cons to both these approaches um but way back in the 1970s because the the machines that they were running on were really limited have had a small amount of memory uh they had to do external merge sort so the literature at the time deemed that sort of uh sorbers joins were better than hash joins right because at least you could have sequential access instead of the random access you have in a hash table then in the 1980s the hash joins became the preferred approach I think it was a combination of two reasons there was the the grace hash join uh technique that came out of the Japan showed how to basically spilled a disc if necessary by partitioning which is basically what the Radix partitioning approach will talk about later on will do um and then there was also this this movement of sort of specialized database accelerator it's called Data database machines things are like specialized Hardware people were building in the 1980s uh that could do hash joins at Hardware and of course they all failed because because of Moore's Law and you know Intel CPUs and other men other Motorola other CPUs are getting better and better at the time uh and by the time you actually fabbed your database machine hardware and got it out in the market the the next version x86 was out and your any benefit you got you lost so hashing was Dean equipment in the 80s then the 1990s there was a paper by grits graphy the guy that invented the volcano approached we've mentioned a couple times before um he wrote a payment busy kid the algorithms were equivalent but there really wasn't any major performance difference between the two for the harder those available at the time then we get in the 2000s and for the last uh two decades uh hashing has been the the dominant approach um and this is just a byproduct of like the horror getting faster the being clever about where the data you know where you're storing the data uh in memory um and just you know better implementations of these systems of these approaches so 2000s the hashing dominated 2010s the question was should I partition or not partition my data before I do my joins and then the current decade we're at now is no no partition is deemed a better one but again it's still worth understanding what what this is and then what the the Sorting one is as well which we'll cover next last so right so this lecture today we're going to focus on on these things so the paper head you guys read was I said it was a uh was a summary of what the the sort of the the previous decade of hash joint implementations look like or in a research literature and we're trying to understand okay which one is actually the right you know what are the right results we should consider in when building a new system because oftentimes they said contradictory things so it kicked off the sort of the modern era hash joins kicked off in 2009 from this paper from Oracle and Intel where they were sharing that hashing was better uh and they had a partition you know the sort of Radix partition approach using this although they claimed in this paper that they think sort merge would be better once AVX 512 comes along of course well adx 512 wasn't called called that in 2009 they said what if we have 512-bit City registers we think sort merge is gonna be better the later paper showed that actually that's not the case but this was sort of the first one that said hey hashing is better then there was a paper from was from the the people in Wisconsin in 2011 where they looked at the differences between partitioning and non-partitioning hash joins um and we can discuss these results later on although it's over a decade old now and the hyper guys came out in 2012 and said okay everybody's wrong uh sort merge is actually better and even without simdi we can we can get better performance in our system with uh over hash strong using swordbridge drum so that was 2012. then a year later they came back and said okay ignore we said here we were wrong um and it turns out you really actually want to use use uh hashing and if you make it no matter where you can be you know you can get much better performance 2013 the same year the uh the the other Germans I guess the Swiss Germans at eth they came out and said okay here's here's another implementation that extends the Wisconsin guys have done with what we did and here's how to get better performance and then again the paper you guys read it came out 2016 with basically said everyone is is showing different results not clear what's actually what's going on it's not clear like what optimizations are getting implemented in all these different approaches they try to do a thorough evaluation of of all these things right and and the main finding is going to be that the non-partitioning one is going to be turned out to be superior both in terms of uh performance in the common case and in terms of implementation right there's a paper that came out in 2015 from the same team uh uh from the same people the year before when they looked at all the different hash table implementations um so this is the follow-up to this which I think is great work and then where we're at now the state of the art is the other the new German same Germans as hyper Germans but these Germans building Umbra uh they have a version of Radix hash join in in their new system ambros the successor to hyper that shows that that the Radix hash join is better marginally and then the engineering costs of getting that better performance just isn't worth it and you're better off trying to implement the new partitioning scheme another thing I'll point out too is like you know because these papers are saying sometimes one is better than another as far as it no no system actually tries to implement multiple variants they usually just pick one hash table invitation they'll pick one hash join implementation and that's it they stick with it it's not worth the engineering cost to be adaptive or dynamic on the Fly based on what the data looks like right there's other things you need to worry about and then we'll see later on at the end the these Germans show that like you're not really spending that much time when you actually uh joins uh in query execution right there's other things to worry about all right so our goals for building this uh our high performance joint algorithm are going to be the following and I will say that this is going to be these goals are going to matter whether or not we're building a sort of hard what is called a harbor conscious algorithm or Hardware epilepsy algorithm and this this distinction just means like are we going to have in our implementation of the hashtag algorithm are there going to be uh sort of parameters or knobs that we can tweak based on what we know the underlying Hardware looks like look what the cache line size is what the the memory access speed and so forth or the you know l123 cache sizes right so cash flavors means you build your album without without having word of any of these things and Hardware conscious means like or sorry conscious however to play this means you don't worry about anything however conscious means you are aware of these things in the implementation so these two goals are still going to matter it's just whether or not the implementation is going to spend the extra time to try to uh and being Hardware conscious to to achieve these as even further than what you can get just through careful programming now all right so the first thing the first goal for us is then our hash join uh and actually for both types of joins is that we're going to minimize we want to minimize the amount of synchronization that's occurring between different worker threads running on the same box right so that means that we want to minimize the the the amount like whether it's necessary for one worker to coordinate with another worker about who's going to write into some space or waiting for the data that they need to generate right that doesn't mean we need to make our entire algorithm latch free because that always doesn't doesn't mean better performance it just needs to be we need to be smart about where we take latches uh to make sure that like we're not bottlenecked on you know everyone doing a writing to a single location the next goal is we want to minimize the amount of the the the cost for us actually going accessing memory as I said assume that we're running on a single node assume that the data that we want to do our join on is already brought into memory right we've already we've already fetched in we've already done the sequential scam we're up in parts of the query operator or the coordinator query plan everything's in memory so now when we want to access data we want to make sure that we we maximize the locality of the data right so ideally you want to have a worker Access Data that's in the same CPU cache at the very least maybe the same numer region um and then when we bring data into our CPU caches we want to maximize the amount of degrees we get out of it before we move on to the next piece of data because we don't want to like Ping Pong bring data in do something on it throw it away and then bring it back in you know a few minutes later because now again we're polluting our cash we're spending a lot of time doing these memory stalls so let's focus deep look dive yeah dive a little deeper into this one here so when in our album implementation we need to be mindful of of what's in our CPU caches or what you know what's the capacity of our CPU caches we also need to be careful about what's in our tlb or Hardware because what we don't want to have is we we have our thread is reading random data and as I said ping pong back and forth and not only are we going to have cache misses installed go on to memory go get it but we also could have misses in our tlb because the data we need is not there or the entry is not is not there right so like if I need to touch 10 tuples and those 10 tuples are in 10 different pages but my tlb only holds five entries I don't want to have to go cycle through uh you know bring things in victim ICL enter my tlb and then bring another one in because now I'm paying two cache misses for Tob and and the data itself right so the two ways we can do this we want to maximize the uh substantial access and this means that we try to Cluster the data that we want to access or you know for what whatever phase where in the join within to be a single cache line bring that those 64 bytes in do whatever you need to do on it and then we're done move it away and then we'll see and one optimization we can do if we have to write out data to uh to an output buffer we can use the streaming instructions to bypass the CPU caches and go immediately to Dera to put it out to dram because we know we're not we're not going to read it write it back for Random Access again for this one we can split things up so that the sort of chunk of data that each operation or step is going to operate on will be within a single cache line so there's gonna be this trade-off between the number of instructions we have to execute to compute the join versus the amount remember we're going to use and you know the header says uh it may be the case that the the extra CPU instructions to partition the data is not going to give us is not going to outweigh the the overhead of uh sorry the extra CPU instructions aren't aren't negating the benefits we're getting or don't overcome the amount of structures we're executing an improvement we're getting is not worth the pounding of spending that time versus just blindly accessing data and that's why no partitioning is going to be better so all these while these things matter the question is are we going to work is it worth the time to do the partitioning steps to maximize these goals or is it just better just to read the data and maybe be a little more careful what chunks of data we're accessing as we go along again we'll see that as when we talk a bit of approaches all right so hash join is the probably the most important operator you could have in Adobe system but again it's it's sort of like the Cindy stuff where if you just try to feel like you know a really small portion of the system and I try to vectorize the hell out of it and I'm going to get amazing potential numbers improvements but it's all the stuff around the the operator whatever the thing I was trying to run all the overhead are getting data in and out of that that sort of small kernel that's going to be dominating the cost so why is important and every system needs to do this as fast as possible it's going to be a bunch of other stuff that actually could matter a lot more now I will say and I'll show this at the end again they claim and the Germans claim that like for tpch some tpch query you're only spending maybe 10 to 20 of the time doing the hash joint or the overall query uh we have numbers from from Impala from from a long time ago that showed in their system and maybe because it was a long time ago they were spending 45 of the time doing hash joins in all the TPC age so I don't actually know what the real number is I I think Rashad measured this I could go back and look I think it's somewhere in between and I'm more inclined to go with the the more recent German numbers than the Apollo numbers but we'll come back to this but regardless it's still going to be an important algorithm we need to operate as fast as possible and again we want and we want to take advantage of all the additional cores that we're getting because getting seat Intel is not going to ratchet up the clock speed a lot more there's going to be more and more cores so we want to take advantage of all those cores again this is and this is all being done independently whether using CMD or not so the basic steps of the hash join are the falling so there's three phases but the first one is optional the first one is the partitioning phase where we're going to do a scan on R and S and we're going to split it up into disjoint subsets or shards or partitions based on some hash function that we're going to use on the join key the idea here is that we want to break it up into smaller groups so that when we go do the subsequent phases the build and the probe that we can have worker threads operate on you know discrete junks and have all that data be local local to it so after again the partitioning phase optional but you don't you don't have to do it uh if you ever sometimes it says the grace hash join right that's the the partition phase comes from that the next you have the the build phase this we're going to scan r on the on on the outer table uh and we're going to create a hash table on the join key I want to talk about what that hash table could look like and what the hash functions will look like then in the Pro page you can see on the inner table look at its join key and hash that and then do a probe into the uh into the hash that we've built in the second phase check to see where they have a match if yes then match the two tuples together and then produce the output and then move on to the next one or for a you know a pipeline write it up to the next operator in the pipeline so the the big thing about the the paper from these Germans here is that they're going to include the materialization cost in the in this last days here the the prophase of combining the two tables and produced in the output some of the papers that that their site don't do this they said okay I have a match and then immediately discard everything uh but obviously that's not real so they they always include that in their execution calls uh and because you know because it's going to affect certainly the caching behavior in the system all right so I'm gonna go through these to each of these phases one by one and understand the the implications of the pros the concept of the design choices we would have for for their implementation so for the partition phase the the two approaches that basically do what are called an implicit partitioning or explicit partitioning so with implicit partitioning we assume that somebody else in the system like when they sent data to us so we we loaded it from the disk has already done some kind of partitioning exactly on the join key that we wanted and therefore we don't need to do any additional partitioning we don't need to do any extra pass the data is already already so the right locations for us this is not usually the case right you can Define join keys but think of now again bringing things off disk into memory from these parquet files it could be just a bunch of random stuff so maybe within the parquet file things are partition but I'm trying to read multiple part k files it's all going to be a a mismatch so with explicit partitioning the the idea here is again we want to scan the uh this first example here we're going to scan the outer out of relation and readership it amongst all the CPU cores uh so yeah I don't I don't know why this is ignore this I'm teaching you rate of purchasing um so we're going to divide divide the the two tables uh it should be two tables two as well typed two tables split them into different CPU cores so then now when we go do the bill phase or the probe phase the CPU cores are operating directly on on the partitions that they're assigned to and we can be careful where where it's being located so that when we reap when we split put the partitions up we put in different Numa regions so that the worker operates only on its local pneuma region so we're not going over that interconnect and paying the longer latency so the uh again split the two two input relations into partition buffers based on the join key and then the big goal of this is that although we're going to pay this penalty of having to do this partitioning the extra CPU the extra instructions and the extra time we're spending to do it is going to outweigh or sorry be less than the extra time we would spend if we didn't do partitioning like we just let the data be where it is and we have an unoptimized execution of the instructions to do the joins uh that's going to be worse as if we didn't do anything but again the research says that it doesn't always turn out to be the case and it's better off just not doing it so for for if it's a row store you have to uh copy the entire table that could be expensive if it's a column store then you only copy the data you actually need which could just be the join of the offset but it depends on whether you're doing early material laser or late materialization all right so the two approaches to do partitioning if you want to do it will be non-blocking and blocking so non-blocking I don't know if they cover this in the I forget the silent paper but the basic idea is that we only have to scan the relation once uh and then we just produce the output as as we go along this simple partitioning scheme and we'll determine where we actually do our rights we'll see that in a second um but then while we're doing the partitioning because we don't need to do multiple passes on it when we put a tuple into the output partition some other thread could pick it up and then do whatever it wants with it we could do the probe or could do the build on the hash table side right we don't have to wait for the partitioning phase to be complete before moving on to the next one right in the The Blocking approach or the radius approaching approach we have made the scan at least twice because the first pass you have to go through and figure out uh what the how to break up the data right in the private partition case or the the the greatest partition case would be you know with this histogram thing and then once we produce the output sorry once we produce this histogram then we go back and start writing to these partitions and then something else could read it that assumes though we may not want to take another Passover it again to partition even further which depending on the uh if you have a lot of skewed data you may actually want to do and we'll see how we handle this in a second so typically uh when when you ever seen the literature they say I'm doing a Radix drawing a radius hash join it's it's with this it's with this extra partitioning phase all right so the non-blocking non-blocking case again the two approaches could be I can either have a shared partitions or a private partition so the basic idea is like where's the output or the where are the worker threads going to write the data that they're partitioning to are they going to write to some Global hash table or Global partition space or they're going to write their own private partitions and this is this is a good example in computer science where there's no free lunch where I could in this case here I could just have them right into this this Global space and and be done with it uh once everyone's finished in this case here I'm going to write locally and that could be faster because it'd be less last contention because I'm not writing to a shared data structure but then I got to put it back into a coalesce into a single space afterwards and then that's going to take extra time so let's let's look at this visually so here's my data table all right and say I'm doing the morsels approach or I'm going to split them up into some number of tuples into these chunks that are each going to be assigned to a single worker thread so say I want to do the join on on column B so I'm going to go through each thread is going to scan through and access B and then hash it based on whatever my hash function is and then there'll be some Global partitions of the hash table let's say we're using a change hash shape change hash table here that I'm just going to split them up and write them at Rotten into um actually it's not even change these are just linked lists of blocks of buffers right and so if I want to update now this uh you know these linked lists because it's a shared data structure meaning any worker can write into it I got to protect them with latches so once I acquire what's whatever three requires the latch then I'm going to pen the next entry to it so this is it's an easier to implement because everyone's just writing to the same space I don't have to do any cleanup afterwards because once everyone's finished their scan over the data table this thing has everything that it actually needs but I could have contention if everyone's trying to write to the same partition uh and I I'm getting you know getting less contention private partitions is basically the same thing but now every single uh every single worker thread has its own local partitions that they can write into they don't have to take latches because nobody else is writing to the same space that they are but then when this is done uh then at the coalescent back to the global partitions because this you know this P1 here has data I need to know that P1 and P1 and P1 here are put together into a single P1 because otherwise I guess I could have false negatives so now what you do is you have uh for each each level of local partitions you have assigned one worker thread to go through across all the different local partitions and then combine the results into the single output at the same time you do think you do the same thing with the other one uh and the last one as well again this is occurring in parallel so again the NCAA doesn't have a bunch of these partitions with with my data in them okay all right so now with Radix partitioning the idea here is that it's basically the same thing but instead of in these previous examples here I did one pass over the data so each thread went through the data once and I wouldn't call this a pass when you do this copy because this is just literally just mem copying into uh into the global partitions so with radic partitioning you got to do one pass over the the table and compute a histogram of the number of tuples per per hash key um for some some Radix which cover that in a second then he uses histogram to figure out within the global partition space where each thread is going to write into so this is like a coordination step that you scan through the data it could be this histogram then you distribute the histogram across all the worker threads and then they can use that to figure out okay if I'm writing into this worker thread so I'm writing into this offset within my partition space here's here's here's where I should jump right into and I know that I don't need to coordinate with any other thread because no other threads I mean right into the same space that I am and then you scan through and then the writing phase occurs in this last step here and again if you want to do recursive partitioning because you have a lot of skew and you have one partition has a ton of data you can you can loop back around and run this again all right so with this first couple of what a Radix is and let's cover what a prefix sum is and we'll see how to put this together to do this uh these three steps so a Radix of a key is just going to be the the value within some position of the of the key itself so assuming we have integer keys so think think of like you know Keys 1912 23 and so forth the a Radix would just be like what's the position of a digit within you know in one position here right the first position so for these the Radix would be nine two three eight one and four and then likewise I could do the same thing for the second position here so what we're going to do is we're going to take we're going to compute the rate X which is just bit shifting and multiplication get the rate X for some the first position of the hash key and then from that we're then computer histogram to say okay for every Radix what's the number of elements that I have in my in my input sequence here and then we're going to use the previous song We'll show the next slide then that's going to help us determine where the offset is going to be so the idea is that for a key it shows up to say this is the hash value of it and at first the Radix is one that I know that there's going to be three other uh values of one that I need to write into my partition space and therefore I know what worker thread I am I would know what offset I'm allowed to write into so the pvx sum is how we're going to get determine that offset pvxom is just taking a sort of a a moving summation of an input sequence where each value that that's produced in the in the output sequence is the sum of the values up to that point right so the very beginning the perfect sum of the first element is just one but in the for the second uh element and the previous sum it's whatever the previous summation was plus the new value so just three and I just do this all all down the line for all it is right so now you can see where I can use that histogram that I had before going back here so now like see the numbers are two three and one so my previous sum would be two five and six and that would tell us again what offset I'm gonna I can write into safely so actually I found this out today uh there's a paper from guy Blalock from like 1990 where he says Hey wouldn't be great if Cindy had this uh to do previous style because that's really important um as far as I can tell there isn't a CMD instruction to do does exactly this like it's it's a bunch of bit shifting stuff to make it actually work and I don't think it's uh I don't think it's performant at least the current AVX 512 okay so but I'm saying guys awesome he thought of this stuff but 30 years ago okay so let's see how to put this all together okay so assuming here we've already we've already uh we've already divided divided up um the data into morsels and so we're gonna have the the the worker threads scan through are the input sequence and again assuming these These are the values of the hash of the keys after we already hashed up but we would obviously hash them on the Fly for Simplicity I'm sure I like this so what we're going to do is we're going to have look at the first offset the First Rate X and we're going to compute a histogram uh each each worker thread is going to produce a histogram and say what's the number of keys that hash or have this sort of this Radix value here right so you just go through scan through and produce this output here so now the histogram for this guy is partition zero so zero here there's two elements or two keys a partition one there's two as well for here for for partition zero there's one and then for partition one there's one two three right so this is the histogram I compute by looking at the Radix so then now assuming I have again this giant output buffer here where I want to put all my my my my tuples in after partitioning them I can use again the prefix sum to determine where should uh where should each worker thread be allowed to start writing the keys that it finds when it scans through back the second time with the second pass right so for for worker thread zero partition zero it's the first element so it's prefix on would be zero at this point so it writes there uh but then since now for for partition zero for the second work thread even right to here and again as they're scanning through this I'm showing two two worker threads here but obviously I think if there was you know 32 of them running in parallel uh they don't need to coordinate about who's writing into what because they've already pre-computed where they're allowed to start writing right so there's no synchronization cost uh other than waiting for everyone to say okay did you compute your histogram exchange data then then compute this which is not that expensive to do right it's the same thing for partition one they can write here safely without coordinating so now again they do the second pass uh scan through the data and then populate the uh the output partition appropriately so there's a simplified version where I'm just showing like the hash keys but think of like it'd be the the you know the keys plus the actual Tuple itself that we're actually writing here not just like you know 30-bit numbers so now if I if I had a lot of skew say partition partition partition zero partition one in the building Pro phase if I just have two worker threads I just let uh CPU or worker zero take partition zero and CPU one take partition one but let's say in this case here I know if it's three versus what three versus five but assume that was like you know 10 versus 10 million uh besides the partition maybe I want to partition this again so I could just rerun the same algorithm recursively and do another two passes on this partition to divide up even further right and then then you're looking at the next Radix digit and it's doing the same thing we did before again most systems will just do this just do do sort of one one round of the two pass algorithm uh for disk basicism sometimes they they go to they go to more than one just because they want divided being further to fit uh page houses on disk again in practice I it depends on how much skew there is this hashing does alleviate some things but if everyone has the same joint key the exact same value then no amount of like partitioning is going to help you because they're all going to Hash the same thing all right so any questions so far yes the question is will this end up being a sorted list uh what do my sword is sort of what the join key but it's hashing So hashing's Random right so you you don't have any sort order it's it's clustered right so like all the like all the tuples with the Radix zero they're all together but there's no guarantee that you know within within this you know this this cluster that they're gonna be ordered based on the join key it's so it's not sorted so basically the partitions can be uh disorder with data partition they're on sorted but yeah they're grouped together I'm using the word clustered that's a better word they're grouped together based on the partition value the like the Radix of the Radix that we generated you said the question is when we write it out to the to the output buffers we don't need to put it in order you actually can't right so going back here like at this point here before this like all we know here is that worker one is going to write partition one data here it doesn't know what's in worker zeros data so how could it then like how could you write it here but then also be sort of within partition one because this is partition one here entirely so you don't want to coordinate with this other worker threads you can't sort things you don't want them you don't want those sort of things you don't need to partition one's element is always the top of partition one does that matter so when you like shift every base then like I think eventually like it will be smart when you shift everything what do you mean sorry will be like sources like if you want to order my um ABCs yeah yeah okay yeah I got it got I gotta go yeah yeah all right I understand yeah sorry this phase no but if you recursively you're just looking at every single Radix uh uh so again what are you saying would it be Global assorted within a partition it's still not this is like globally but like uh yes globally yes but like it's you can't do anything with it though because you have to do look up the hash table and that's going to be another random access I don't think you can take advantage of it anyway yeah you still have to go you you approach the hash key hash table anyway that's gonna be random and then within the depending on hash what hash table you're using or hash Museum you're using that's going to be a sequential sequential scan within that I don't think there's nothing unless the hash table is sorted which it won't be you don't get any benefit from this that's a interesting observation okay all right so there's two optimizations that the uh the payment points out they became makes a big difference the first one is the software rate combined buffers and the idea here is that in uh in my example here I was showing that the partition the worker threads were just writing through this Global partition space right and it gets on PowerPoint so who cares there's not any you know not a real system but if it was actually writing to a you know continuous chunk of memory like this it could be really bad just doing this uh sort of this Random Access uh go back here like I write to this partition then I write this other partition here uh and what you instead want to do is have a little buffer that's local to your uh to each worker worker and right right all your your new updates there when that's full up to a cache line then you write that all once in a batch out to uh the memory space right and this one removes the pressure on the tlb and uh improves cash flow quality in in the data you're accessing right it's similar to the private partition stuff we saw before but uh you don't have to do that separate right phase where the worker threads scan through and combine everything uh afterwards I can still get the same benefit of everyone sort of writing locally and then they write out a batch to get better performance and then the next one is the streaming right stuff that we talked about before again as I'm writing out this partition the data to my you know partition uh uh partition output buffer I know I'm potentially not gonna have to read it again again knowing where Crystal partition so as soon as I write data out into here right I do this right I know I'm not going to go back and try to read what I just wrote that's going to come later when I do the build or the uh or the the probe and the hash joint so there's instructions in x86 they call it works streaming rights where I can write to the memory location and it bypass the CPU cache and I don't pollute it I know you can do this with CMD I don't know if you can do this for regular uh regular regular scalar variables right okay so the combination of these two things and we'll see this when they do the optimize evaluation uh you know this is going to be a much better performance over like you know basic implementations there's other optimizations that they cover like being a numa where which we've covered before um and I'll talk about a few other tweaks you can do as well all right so now you know we've done partitioning now we want to do the build phase and again the idea here is we scan on the outer table on R either the tuples themselves or the partition or the partitions of them and then for each tube what we find we're going to Hash it based on the same hash function we used before and then write it out to some location in our hash table and if we're doing uh you know if we're organizing the data in the hashables buckets we want to have sort of these these buckets be equivalent to a few cache lines and nicely aligned so that we're not paying the penalty of underlying access so a hash table is a combination of two components um so when people say I have a hash table it's going to be you need two of these things the first will be a hash function and that's going to be taking a arbitrary value such as our join key and mapping it to some smaller domain typically a 32 or 64-bit integer and then it which will be random um and the we want to focus on this this or with the economy of this trade-off between for a hash function to how fast it's actually going to be to compute the hash versus what the Collision rate is going to be and what cover that looks like in a second and then the second piece is now after we when we do a hashing on our values or join key we're invariably going to have collisions the question is how are we going to handle collisions in our data structure so again board trade-offs here between do we have a really large hash table which won't have any collisions or do we have a smaller one and have collisions and then have to pay the extra instructions to where to find places to insert data and find find the key when we want to do a lookup later on so we don't have a lot of times I want to go quickly on uh both of these because again the main the main takeaway is going to be uh XX hash on Facebook is me the fastest hash function and then the linear linear probing hash table will be the fastest implementation but I do like the cover like cuckoo hashing and hopscot caching just cases so you guys are aware that these things exist I'll explain why you don't want to do them so that when you go in the real world somebody says hey we should use this this fancy hash table the answer you just say no um we can point to this lecture all right so for a hash function into hash join we don't care about cryptographic Properties or guarantees we don't care about being able to decrypt it it's a one-way hash so we want something that's really fast and has a low Collision rate best hash function you can have be the fastest would just be always return one right because it's the simplest thing to do it's you know it's copying one data from register to another register but of course that's going to have a terrible Collision rate because everything no matter what key you give it it's going to have to one the best collision rate would be would what would be the slowest is what's called a perfect hash function and this is a theoretical concept where no matter what key I give it I generate a a guarantee to be unique hash value so again this this is in the research literature uh you can't actually build this because the way you actually implement it is with another hash table so you have a hash table for your hash table and like you see nobody does this so for the latest up-to-date numbers on what the fastest hash function is there's this there's this uh Benchmark from called Smasher which I think is from the murmur2 guy uh or the remember hash function guy but he basically has like this this like stress test for these uh different hash functions that are out there and you can measure like the the Collision rate versus like the performance so I don't cover spend the time on this but here's a bunch of hash functions the new the modern era of hash function started I think in 2008 where there's some Rando on the internet built murmur hash he put it out there and then people started using it because it had like this nice trade-off between Collision rate and performance the Facebook or so Google took this and they forked it and that became I think City hash um the guy that built Z standard the compression stuff we talked before he built XX hash which again this is still considered the state of the art there are things XX H3 is the latest one and it performs really well uh and then 2016 was Highway hash uh there's there's a fall of the farm hash but again you wouldn't use this for a hash join so there's that micro Benchmark that I run uh uh there's a with a different framework and is measuring for different key sizes what the the throughput is and again you can see again XX hash three just blows everyone away and the Sawtooth pattern here is is the is the cache line sizes right that like when I when I have another cache Miss uh when I'm doing these hashing uh as the cake gets bigger and I fill up my cash I'm getting you know I'm getting more throughput I'm getting I'm processing more data so this around this Benchmark on uh my desktop machine and then I ran it again on dev 12 and Dev 12 was much slower I have to look at see what's going on which is supposed to be a newer CPU um anyway so city city and farm hash don't use simdi I don't think XX hash uses MD so none of these are the thing I've used XMD because using City makes it less portable and they're trying to be your general purpose thing right all right so again we're going to use access hash in our in most systems and then we try to pick up what a hashing scheme is so I'll quickly go through these again linear Pro patching will be the most important one the most common one chain hashing is probably the second most common and then these show up from people that sometimes like oh this looks clever let's try to do this uh but unless you know what you're doing you probably shouldn't like for example the before the pandemic we had the guy that the the the co-founder of influxdb he came and gave talking to CMU he mentioned they were using Robinhood hashing Dave Anderson was in the audience and asked him why and he said one of his Engineers sold on Hacker News it seemed like a cool thing so they implemented it but it's like that's not a good reason right the research literature clearly shows that linear Pro hashing is the better approach all right so with Jade hashing this is what most people think about when you get a hash table in Java um when you say you have a hash table um and the idea basically is that there's going to be these these linked lists of buckets uh for each slot on our hash table and to handle Collision we basically follow the the link list until we find a position where we can a free space where we can put our key right and then when I do a lookup we have to scan the entire bucket chain for it for a slot and we either refine the key and we stop or we reach the end and we know that it doesn't exist anymore so conceptually it's basically the same thing as a linear probe hash table but the linear hash table has this giant you know single size array whereas the change hash table can expand because you decrease the size of the chain over time so let's say these are the keys I want to get in there these are my bucket pointers and these are my buckets so for key a i hash it I landed this bucket I find the first slot I can write it here for key B uh get a hashtag and write it there see so forth now for D when I landed this bucket both slots are occupied so what I do now is just extend the the chain for this location this this this part of the the hash table and create a new page or create a new block of memory and I write my entry to D there and I can do the same thing for e F and so forth so what optimization that hyper does which I think is clever because they use a change hash table is that for the pointers that they're storing in this little slot right here as well as the pointers that they're storing between buckets they actually store the memory address which is actually only 48 bits in x86 and then you use the remaining 16 bits for a bloom filter to tell you whether the key you want is in there so everyone know everyone understand this so like when you allocate a 64-bit pointer in x86 it is it you know it does take 64 bits but the hardware only uses the first 48 bits you can't really have two to the water 64 you know locations of memory it's 2 to the 48 and the reason why they did this Intel did this was because you know when they they knew nobody actually really needed 64 bits but instead of making like weird 48-bit registers they just said let's just make it 64 bits and then the hardware only addresses up to 48 bits so that you know when you allocate uh you know a pointer that's 64 bits there's 16 bits you can put whatever you want in there and the compiler doesn't care and then Harvard doesn't care it'll ignore it so anyways use that extra 16 bits to put a bloom filter to tell you is the key you're looking for actually in you know going to be on the longest chain right you can have false positives because that's Bloom filters are proximate data structures so you could follow the chain and actually find any thing you need but it could save you doing additional lookups because the bloom filter will tell you the thing you want is definitely not there you can skip it they can use the lowest three or four pixels yeah yeah if you're using memory aligned uh in simplest plus you can uh you can know that you don't need full 48 bits and you can make the Bluetooth even bigger yes we used to do tricks not for blimples we use memory alignment tricks like that in uh in noise Page look up stuff in Apache Arrow we think that offline all right linear Pro patching it's basically the same concept but instead of having this version this this bucket chain that can get extended I just have this giant giant array and so the way I'm going to resolve collisions is that I'm going to scan until I find either a empty space in my my in my array or which one I could do my insert or I find an empty space and because I'm doing a search and I know that the key I'm looking for isn't there and I if I if I reach the end of the array I can wrap around and start from the top from the beginning start over and obviously I need to keep track of where I started or enter the hash table so that if I loop back around and hit the same space I know that the thing I'm looking for isn't there there's no free spaces and therefore I can stop and break out of an infant loop so it looks like this same same keys I want to sort or store in this Hash a land here hash B land there hash C I go here but a is already occupied so I just go down to the next free spot and then I could put c in there same thing for d uh and e will go again go go here so again if I'm doing a lookup on E I would hash it here and I would do my comparison because e equal a no skip the next one I keep going until I find either the key I'm looking for an empty spot and the same thing for f right so for this one you pay the same penalty for both on the the build side the probe size the probe side to do these lookups because uh because because you know you could have collisions that like the cost of trying to find something is the cost the same trying to insert something when I have collisions we'll see examples in a second where the the sort of extensions do this we'll try to shuffle things around so that lookups will be faster than inserts and again depending on what your workload looks like that may be the right right choice but in practice it actually it usually doesn't pan out to be the case right so this is always always going to be the fastest because it's so simple and there's no indirection there's no there's no well there's no like maybe it's branching because you have to decide whether to jump to the next thing or not but there isn't a bunch of once once data is stored in the hash table you're done you don't have to go back and move it later on so so there's this sort of thought in the literature for a while that basically said hey it's clear that if you're building a hash table uh you know with a certain number of keys but then I'm going to probe it a lot right I'm more I'm going to read the data structure more than I'm going to write to it then maybe it's worth the the overhead of doing a little extra work when I'm writing to it to move things around so that when I read the data it's it's more likely to be in a position I'm looking for right so if I want to avoid this this this this long search when I do inserts and uh and lookups I just make a really big hash table so that means guarantees or reduces the the likelihood of a collision but there may be other tricks we could do potentially that could Shuffle things around so that when I do my lookups I finally think what I want right away so the first technique would be Robin Hood hashing the idea here is that we want to extend leader improve hashing where to allow for workers to steal locations or slots in the hash table from Rich Keys and give them to porkies and they're fighting Rich versus poor is the number of hops you are or steps away you are from where you should have been if there was no Collision so the idea is that if I I have one key that's already in my hashable they're they're really close to where they should be and the new key I'm trying to insert is really far away I'm better off swapping the position uh and having the the poor key be closer than than they would have otherwise been in regular linear Pro patching right so it looks like this so going back here I insert a here and it's a 90s or whatever the hash key is as well with the original value but now I'm also going to store the number of hops I am from or positions I am from where I should have been if there was no Collision so with empty hash table a lands exactly where it should be therefore it's it's zero zero jumps away from what it should be same thing with C goes up here or sorry B goes up here he's fine but now I want to insert C C wants to go where a is but at this point here C is zero Hops and where it should be a is zero hops where it should be so c will leave a alone and then C goes down here but now we update C's counter to say I'm one hop away because I'm I should be up here but I'm down here I'm going away now I do a uh insert with d so at this point here D is zero hops away because it wants to go here but C is already occupied but C is one hop away so it's greater than than where D is so D is not allowed to steal from C so D just goes down here now in insert e same thing zero both E and A or zero hops so a stays where it is at this point here C is one hop is one hop so they stay where they are but now at this point now e is two hops uh whereas D is one hop so two is greater than one so at this point e is allowed to steal from D she's from the head steals his position and then now we have the worker thread has to go figure out where to put D back in so D ends up Landing here right so now on average uh if I do a lot of lookup stuff on E and D the the distance is basically going to be the same right is the same then F goes goes down here right so I don't think I guess why this is bad for performance a bunch of copying right like going back here so I gotta find out where to put it oh I'm going to take this one so so I gotta copy D out put it somewhere in a buffer then write an e and then put go down put D here right and that is that's not for free so even though again we're doing this on the build side and not the probe side it's gonna you know the overhead is just gonna be way too much and then the benefit you get when you actually do the probe is gonna be marginal compared to this because also too this is unbounded right I could just keep going and trying to you know swap stop swap back forth right until uh I finally found a position that that's free like and where everything's all balanced out so yes simple case sure this is fine but on average not not so much so hopscot hashing is a modern variant of Robin Hood hashing Robin Hood came in like 1985. this is from 2008. so it's a variant of of linear probe hashing where again we're going to allow uh we're gonna allow keys to steal positions from other keys but we're only going to do it in the context of what they're called neighborhoods and so this is a way to bound how far away you you have to look or how far away you have to move something when you start trying to shuffle things around the idea here is you want this the size of a neighborhood to be you know a single cache line or a small number of cache line so that at least when you're when you're trying to decide where to shuffle things around you're not paying penalties to go out to memory and figure out what you know is it okay to put something there I try to look only for the things that are that are already in my CPU caches so you have to have a guarantee that a key has to be in his neighborhood uh or doesn't exist at all so again Robin Hood hashing I can keep moving shuffling things around usually linear Pro hashing I keep shuffling things around it definitely do I loop back around and I'm infinite Loop and I'm done in hopscot hashing we're going to bound into a neighborhood meaning like even if I have extra free space in my hash table where I could put something if it doesn't fit my neighborhood Then I then I'm considered full and I have to you know stop blow in the hash table and rebuild it to be double the size it was right right so the the high level goal what they're trying to do is that because they're keeping things in the same cache line the the cost of going looking up and getting the neighborhood should be the cost of of actually trying to find the key within that neighborhood because you already paid the penalty for the cat The Cash Line missed to go do the lookup that's the overarching goal what they're trying to do all right so for this simple explanation I'm saying the neighborhood size is three so the neighborhoods will be overlapping so for this sort of First Position here this is neighborhood one two three four and so forth right and then from the bottom ones like neighborhood six it can wrap around and start at the top again right all right so I want to start a so this is it's in this neighborhood again I can insert into any any of these slots within this neighborhood and that's considered fine but in practice you always just put the first one right so a goes there and then B wants to go this that's its neighborhood so it'll go up there now I want to insert C C is in the same neighborhood as a a is occupied so just like linear per patching I just scan down until I find a free slot and then I can put C there D wants to go where she is this is this neighborhood again same thing scan down until I find d uh empty slot I can write that all right so now I want to insert e but e goes where a is and so if I scan through I'm going to see that all the positions in my neighborhood are occupied so again we have to have to guarantee that okay either it doesn't exist or it exists in its neighborhood so what it needs to go do is now go back and look at the keys that were in the that are just scanned through and figure out whether any of those could be shuffled to to another position and still be in its neighborhood and everything's fine so I sort of keep track of the stack but here's all the things I just passed through when I couldn't find my free slot and I would see okay we'll we'll for uh there should be it should be yeah sorry D hashes here at this neighborhood but it's in this position here so for D's neighborhood neighboring four it can move down here and that's just fine it's still in its neighborhood so now with this I can go put e in in that position there right and then for this last one f it just goes at the bottom here so again it's it's like Robin and hashing where we can move guys steal the position uh but rather than keeping track of like on this number of positions away it's just implicitly based on where they're located whether you determine whether or not you're in their neighborhood or not all right quickly cuckoo hashing um I this one I do think shows up in real systems I think IBM db2 blue did cuckoo hashing uh I forget the number of of hash tables they used or number of hash functions they used the basic idea here is that uh the you're gonna have either you could have multiple tables I'll show the multiple tables you could have a single table just different hash functions but the idea here is that when I do an insert I'm going to Hash it multiple times the key multiple times find different positions and I find whatever whatever one's free and if once if both of all the positions I'm trying to look at if they're not free I'll steal one of their still one of the keys that was in there put my key in there and then move it move the key I took it out to some other location so you have the same problems you had in Robinhood hashing hopscot hashing where I'm paying this copying penalty to balance things around but the benefit is that when I do my scans or so my probes I'm definitely going to have 01 lookup because I'm going to immediately land to where the key should be or it's not there all right so say I have two hash tables two hash functions I want to insert X uh both of these the first hash function goes here second hash function goes there I flip a coin I decide okay I'll write X here now I'm going to insert y again hash it twice first hash function maps to where x30 is so that's occupied second hash function goes to this this position here that's empty so I can put y there but now I come along with z and for both hash functions mapped to locations in the two hash tables that are occupied so now we're going to flip a coin and decide okay let's go steal the the position of of the the second hash table for y so Z is going to get inserted there to overwrite Y but now we pull y out and now we're gonna put it back on the other hash table right so we hash it it lands here and X is located here so again we can go steal his positions we take X output Y in now we're going to put X back in so if we come back to the other one hash that I mean finally find a free slot if we loop back around and recognize that we're back to where we started whatever key we're trying to hash and put in are pointing to two locations that are already occupied that we've already seen then we know we're in an infant loop and we have to stop and kill us and rebuild hash table to be double the size it was again same thing nice o1 lookups but the the the build phase is expensive because now of all this extra copying yeah at Hopscotch hashing I don't know if anybody actually anybody actually implements that it's a different it's a variation about hashing a Robin Hood I like it okay so now we have our hash table so now we got to do the probe uh there isn't any magic to this we you know if we partition the data then each worker is going to scan through the partitions otherwise that you break up into morsels or whatever chunks you want in order to scan through use the same hash function that we did to do the probe and the partitioning phase do a look up in the hashtag we built the build phase check to see you know whether we have a match if so then we produce our tuples uh combiner tuples together and proves the output right so the one optimization you can do comes from Vector wise uh and what happens is when you do the build phase and you build your hash table you also build an auxiliary Bloom filter on the side that tells you whether the uh tells you whether the key you're actually looking for is even in the hash table right and the idea here is that the cost of doing the lookup in the Bloomfield is so cheap relative to the hash table look up then it's just go check that first and it's always going to be worth it so the idea here I want to join a and b so a is going to have to build on the build side it's going to build the hash table but it's also going to build some Bloom filter and then now when I start doing on the on the probe phase with B I pass that over to B B checks the bloom filter first if the key is looking for is found in the bloom filter then it does the actual lookup in the in the hash table if not if the bloom filter says it's not there then you know it's not there because you can't have false negatives then you don't even bother doing the lookup and the hash table right so the vector Wise Guys reporting that you get about a 2X performance speed up for really selective scans like if most of the things are not going to be found in the hash table then it's Bloom filter is going to crush it you know Crush performance all right it's like doing a predicate push down but the information that you're you're generating to tell you whether the the tube would even match is generally on the other side on the on the build side and again from a strict relational model standpoint of of these these query plans of these trees these operators going into each other you know a b is not really supposed to know about a right so that's why it's sort of the sideway information passing thing okay so to finish up so the the again the the German paper you guys read the even though all these are variants uh that that we've covered um again they're only going to look at chain hashing uh linear Pro patching and then something called a concise hash table which came from IBM we don't need to cover that uh I don't think anybody actually does that other than IBM um it's basically it's like a giant array with blue and put their stuff in front of it um but all this stuff in into a single uh test band and then they're gonna have the what they deemed is the the unoptimized versions or they call the white box versions of the implementations based on reading of the papers of the other uh the other the algorithms of the papers that discuss the algorithms and then they're going to have what they call their black box implementations where they actually go through and understand what are the bottlenecks and I actually try to optimize them uh even further so the core approach technical paragance is the no partitioning hash join uh again the concise hash table for IBM and then the two pass Radix hash join either using chained hash table or Lane Pro hashing table and then we'll have a special case variance of these where they're going to use just instead of a hatchable use arrays because you know you have uh monotonically increasing primary key integers like one two three four five six seven nine ten whatever you know exactly where to go jump into the array to find that you know find the key for that value right you can't always do that you're all aren't always going to have keys that look like this so again it's nice to know see this is the best you can actually do but nobody actually does this so I'm just going to show one graph uh and so the this can't really see the division over here these are all the optimized ones the black box or the white box implementations where again they went through and optimized them further and then these are all the ones that were just again if you just read the papers this is what they came up with the the sort merge joined this is what we'll cover next class um so so I realized the paper talks about a little bit the paper to read uh after the the spring break covers exactly what this approach actually is and so these are the abbreviations which I don't like that they use in the paper just kind of confusing um just the mapping to them so again what's the main takeaway from this the you know the the no partition one does better than all of them uh and all these extra ones are like additional oppositions you can do uh to you know like the the streaming rights and then the the the combined buffer output or the the localized buffers all these things like if you do all those things and they're tuning exactly for what the Harvard actually wants to expect and look like then you get uh you know almost 2x better performance right but for this one the no partitioning with linear propassion you don't do anything you just implement it once and it runs on whatever Hardware you have and it does you know not as good but it's pretty good so this is why again most most systems are going to choose this so then they have this other graph that says okay well how much time are you actually spending doing joins in a real system uh you know relative to all the other parts you have to do like like producing output buffers or scanning the table and so forth and what they show is that the you're really spending you know at most 20 of the time doing hash joins for an M every data set with an engine like this so the the argument that they make is that all the extra work you do of these different algorithms to do hashing really really efficiently doesn't maybe matter that much and then there's other things in the system that you you should be optimizing and I would argue actually the query Optimizer is probably the thing that has the most impact on performance because if you have a crappy you know uh joint ordering for your queries it doesn't matter whether you're using an optimized hash join it's still going to be super slow right so just to show you alternative numbers this is again this this is uh profile data that was that was sent to us uh by actually hippocratus the guy who will talk about ratio of later in the semester because he used to work on clutter and Paula um this is the from profiling data they sent us uh for running tpch on their cluster this is how much time they you know their system was spending doing hash join now I don't know whether this also includes Network traffic I don't have these numbers anymore but you know they argue that you're spending almost 50 of the time doing hash joins and therefore you should worry about making that be efficient as efficient as possible again these are dated numbers I'm I'm more inclined to believe the the German numbers but I just want to show these to say again you may see other kind of other numbers out there okay all right so again the main takeaway from this class is that it's kind of like a teacher stuff that say like okay don't do this but it's like dare like like when I was growing up like they show you how to do drugs and they say don't do it right um and I wouldn't have known about those drugs unless they told me about them right uh so it's kind of like that like hey here's a bunch of stuff don't do it which may make you go do it I don't know um but the partition hash joint is going to outperform in in a bunch of use cases then the no purchasing approach but it's not it's not trivial to implement and it's just not worth the extra uh engineering effort to get that last like 10 performance because uh there's again there's other things you should be worried about in in your system and as I said most data system vendors are going to pick one hash joint implication one hash function and that can be done with it and don't try to be clever about like oh my data looks like this therefore I should use you know Radix partitioning this way or versus no producing that way everybody's just just does one thing and that's it that's my favorite what is it yes I make a mess I'm less I could do it like a Geo [Music] [Music] [Music] is straight so it really don't matter foreign