Transcript for:
Under the Hood Hybrid Tables

welcome to under the hood hybrid tables so this is under the hood hybrid tables which as the name says it's going to be full of detail so it's going to skew towards the geeky side of the audience that said my name is Christian I'm an engineer of snowflake and uh my area of responsibility is the architecture of unistore and FDB and uh I have with me David uh who is a senior engineer in performance uh from our Berlin office and between the two of us uh hopefully we'll be able to give you a good view of how un store was built and kind of the architectural reasons why it was built the way it was built so you can reason and discuss with us uh the implications of of those architectural choices we can't start this discussion without going a little bit through the business case for unistore in general and I'm sure you've seen this slide in other presentations and basically simply describes the current state of the database or at least the relational database world where you end up with this split brain problem and for historical reasons you end up with LP databases being completely different engines completely different Stacks than ol uh analytical engines and all sorts of side effects coming from that both for security impedence mismatches data transfers from one store to another cost uh latency distortions you know governance um so there are all sort of uh side effects that databased vendors unfortunately push onto our customers and uh our answer to this or our attempt is to effectively create a single store that is doing a good job with both workloads so that's in brief I'm not going to come back to the business case but that's roughly the motivation for unistore in order to drill down in the underlying structure of uh un store and hybrid tables we start with the traditional snowflake architecture it's a diagram that I'm sure you've seen before you're very familiar with effectively it's a three- layer architecture where the cloud services orchestrate the database and you can think of it as the brain of the operation and the giveaways the Met metadata manager the fact that metadata is stored there means that that's the only tier that really understands what's going on globally but more things are in that tier security connectivity optimiz compilation and then you have the query processing tier which is the work force right the virtual warehouses the work force of actual query execution finally then the storage tier which distributes uh and maintains the data artifacts whether that's the raw data or results or partial results and so on and so forth so again this I imagine everybody in this room is very familiar with it now we're going to zoom into the bottom layer and this is how un stor is uh looking into our current documentation so the bottom layer gets split in two and the The Orange Box on the left is uh hybrid tables we're going to drill down into that so here's the same layer and what I've done in this slide is we stripped all of the software elements here you only see hardware and the reason for this uh diagram is to show that while traditional snowflake tables rely on blob storage uh whether it's A3 buckets or you know blobs in Azure but that is the fundamental storage medium uh for those tables um the hardware is fundamentally different in un store hybrid table relies on Direct attached storage and the reason for this is very simple be while blobs are very good for throughput for large scans uh they're really not the right medium for low latency so latency is the reason for that choice uh generally there a one to two orders of magnitude difference between these two storage types even manage discs today across all three major Cloud providers you can get a submillisecond response time from a managed dis local discs even faster uh whereas you know your P99 is likely in the hundreds milliseconds when you go to blob store which makes them fundamentally not fit for the job of doing H tab or oltp like uh workloads Hardware is different this is uh the first point and now on top of the hardware we're adding the first layer and the first layer in the case of hybrid table is foundation DB Foundation DB or FDB as we call it is at the bottom layer of our stack what is FDB well it's an open source it's well documented cluster hosted key value store and the fact that it's well document is important you can go now online and actually see the documentation it's quite good and you can check me in real time what I say it's accurate my colleagues will do that anyway so but the reason we used FDB um I'm going to go through through all these properties so so you have an understanding where we came from when we made that choice um but primarily we used FDB in various roles internally to snowflake before choosing it as the platform of choice the underlying storage technology for hybrid tables so I'm going to discuss initially the native properties that we took from FDB and then I'm going to talk about the additions how we Chang the FDB platform to make it fit for uh for a htap uh storage subsystem one of the main reasons we picked FDB for as the underlying Technologies because it is a reliable data store by the time we started using uni store we had a good history like I said inhouse of using FDB but there were other tier one data companies that were using FDB in large deployments and consistently the story that came out of uh these use cases real life use cases was one of um good reliability I have a little anecdote FDB was started in or was released in the 2012 2013 time frame so it's more than a decade old technology but it's claimed to fame when it came to Market was uh testability so the creators of this uh open source database understood that building databases it's complex it's a complex job and building distributed systems is a complex job so when you put these two together you end up with uh exponential complexity so fdbs claim to fame is this verification framework this validation framework essentially what you can do with FDB is you run the entire cluster in a single machine FDB has its own language in which is developed it's a language derived from C++ and the the entire purpose of this infrastructure is when you run in simulation you can verify all of the code paths including all of the error paths which gives you not just excellent code coverage but excellent functional coverage uh to the point where when I onboarded on Snowflake and on FDB multiple senior Engineers were telling me how we have not seen corruption in production due to this technology which for me as a database engineer was very sad satisfying because it's very rare that you see this Arc from idea to its realization so uh so explicit uh and quite successful so reliability was one of the big reasons we went this route right next to reliability availability uh has to be uh you know what I call out as as one of the reasons we selected this again was cl it wasn't a cloud native database but it was a CL uh clustered native database meaning that uh availability with distribution of multiple replicas was built in was part of the package um in our context being a cloud database means that we put at least three replicas across at least three A's and in doing so we arrive at the same level of reliability and availability that any of the other Cloud vendors or actually any Cloud scale storage currently provides meaning that you are isolated from any a going down effectively FDB is an efficient database out of the box we good we got a a good performance Baseline for all of these native properties however we have gone beyond we we we pushed the product Beyond Where We inherited it um so we took that performance Baseline and we improved the performance at steady state by effectively changing the storage engine completely for for a storage Engine with better memory utilization better iio Behavior so that was one uh major change but the other element that we looked at is we looked at the speed of recovery how quickly does the system when individual storage nodes go down how quickly can they bounce back and and recover because from our perspective if you have you know tens of thousands of nodes under management something is going down at any one point and uh for um operational agility you have to be able to block for as little time as possible so that was that was quite important the fast recovery also speaking of operational agility if the allows us to roll out updates without taking down the system uh because it has robust support for online updates um this is important again when you have a large property of of storage nodes there is no one bit that you can flip and change the version from one to another We Roll Out versions quite frequently so A system that doesn't understand heterogeneous versions within within a set of Hardware within a cluster it's simply not table so FDB had that understanding of of the need to U support multiple versions and finally to keep in mind the core property ultimately this key Value Store shows us the keys in a sorted uh way right so it's a based on a an assorted collection a B3 at scale and it has support for microtransactions uh which uh are not the transactions that you're used to from um from the SQL world because uh they're quite Limited in terms of how big and how much data they can touch but even so you are able in FDB to touch multiple rows at the same time modify them in a way that's uh Atomic and uh we have used this capability to build our transaction manager on top of these microtransactions and provide SQL transactions that you uh expect out of hybrid tables but for all these native benefits so to speak we had to modify FDB significantly with new capabilities to make it suitable for H workloads here I changed I'll go back just one second so you see the transitions that those three discs which were the FDB cluster before I'm changing them to these uh uh send box stacked uh send box picture and the reason for most of our changes for transforming FDB Beyond where it was as the open source project is uh the Need For multi-tenancy What uh multi- tency requires us to do uh we need to send boox individual users such that activity from one user does not end up affecting other users in the system and we do this by enforcing usage quotas to form a mental model of what usage quotas is imagine that you're adding together all of the properties of the system number VI Ops cumulative across all the machines involved in a cluster all the memory all the CPU all the network transactions you could possibly do this is your budget now you're going to slice it and for every tenant you're going to give them a subset of those slices so this is oversimplified image of that and then you prevent that tenant from executing outside of this envelope right you put in the mechanisms that keep them uh boxed in so to speak now this is again uh very simplified because it does not take into account the fact that you have resources that are local that you also have to protect from over utilization like if a single node is over used even below quoda uh you still have to react to that but but it's a good mental model um so this is what the sendbox does but moreover the sendbox has to prevent any possibility of accidental data leakage between between tenants so we had to harden the data both for data in transit as well as data at rest then we added this continuous backup function out of FDB so uh you can see the blob storage that I drew below The Orange Box with the two arrows this represents uh at least the outgoing Arrow represents the continuous backup uh what is this continuous backup basic basically as data changes in FDB we're taking the old data that has been modified and we're moving it out to blob storage and it has a symmetry with what you see in the in the traditional uh snowflake tables but what this does is effectively creates a historical record then we use this for functionality such as clone database as of or as of querying right when you go back in time when you want to see the data as it was at a particular point in time but again the mental model here is that that the FDB cluster holds the fresh data while the blobs hold historical data and this is quite a significant change from the way FDB operates in in the open source version the layers uh above uh FDB will use the same representation of the data that we use for backup for things like uh I already mentioned historical as of quering Clone as of but also adaptive querying where you have a query that's Point look up on re on on uh current data it would go to the key value store to the hardware that's attached to nvme drives whereas if you have large scans it's going to use the same logic for which we're using blobs in the traditional snowflake tables is going to go after the backup format and it's going to extract the data from from there the answer from that uh type of storage so adaptive query is derived from from the same information that we use for continuous backup moreover um we use the same logic or the same capability for Native bulkload and here's one way to think of native bul load is uh all TP operational databases have trouble with bulk operations why it's because those engines are built for small operation very discrete uh fast transactions so all of the mechanisms if you think think about you have a log manager which is very discrete you have one small record for every modification in the database lock manager you have one lock for every row that you touch well what happens if I come with a bulkload that's you know hundreds of millions of rows well you have to take locks on all those 100 millions of rows so basically bringing data on the top of the funnel for transactional databases is a big problem so because we were trying to build an htab system where ETL is is a fact of life we thought of building bulk load natively by using the same capability that we had from backup restore and from adaptive querying in effect inverting the direction of backup so if you bring the files in the same format uh at the bottom of the stack we implemented loading them in the engine uh and that operation can be done without discrete uh micro optimizations like all the P does is done in parallel and it's quite fast the other space where we had to change quite a few things are is in the control plane because uh offering a database as a service means that we need to react uh we need our our clusters to be a lot more elastic than you would if you manage database internally for internal workload because you end up with very uh unpredictable Dynamic capacity fluctuations in in in a cloud in a public CL Cloud uh so we had to improve uh facelift the the control planes such that we can grow in string clusters we can move tenants uh we can rebalance utilization runtime and and we have to do all of this without IND user downtime so all of these are new capabilities and finally we've invested a significant amount of time in what we call Quality of service what does quality of service mean in that context in our context it means the effort to reduce the variance between the best client interaction and the worst client diretion and one example here would be your experience when you commit a transaction and when we started this project you would see our P99 which is to say a one in a 100 transactions be 10 to 30 times slower than the average transaction so the variance quite wide so the entire quality of service effort is across multiple dimensions of which commit is one example is to reduce that variance and we can see now that effort paying off right now we're about three to five times variation so We Shrunk that variation quite a bit and this is a continuous process it doesn't end here will will continue such that uh you know quality of service is something that is perceived as predictable throughout the entire utilization this work is a lot of performance characterization is CPU efficiency work is athenz uh so it's quite a bit of detail work throughout the entire surface of the product and now I'm going to invite David to uh walk you up the stack FDB is almost behind us yeah thanks Christian so Christian has talked about how we transformed FDB to become this powerful basis that powers hybrid tables here at the bottom but let's take a step back and take a look at what hybrid tables actually are and what you can use them for and you can use hybrid tables for Fantasy Football I've heard that's quite popular here we have something similar in Germany actually but with soccer of course but really you can take any app here that has a large user base where a lot of users are concurrently interacting with your data on with operations or transactions so queries that are quite simple but require very low latencies for such an app you usually need to build a backend so a backend with an old TP database so a database that supports this really highly concurrent exess with low latencies you'll likely have to run a couple of servers to catch any usage spikes so for example right before game day starts the users are doing some last minute transactions buying selling the athletes and this all comes with a cost both for building the system but also for ensuring it's compliant so uh the up time needs to be appropriate backups maintenance all of these costs are needed to be paid for this extra back end and if you want to run analytical queries against this data and you most likely do because you want to build the most userfriendly app that uh that is you have to extract your data from this transactional database over the eatl pipeline or extract transform load pipeline to your analytical database there you can run analytical queries like answering questions like what was the mostly traded player in the last 24 hours or or what section of my web app is maybe not really findable by users is there maybe a button that is not identifiable as such how can I improve the usability of my app even further and bring make it give the maximum value to my users now with with hybrid tables you don't need the separate backend you can use hybrid tables as your back end since it supports this really highly concurrent access and supports these low double digit millisecond latencies now that we have seen a very good use case for hybrid tables let's take a closer look at what hybrid tables actually are here I've taken a slide from our customer facing deck about unistore unistore performant and easy to use I really like this slide but I think it can be simplified even further just by adding the six word here hybrid on the right to your create table statement what you'll get is seamless integration with the rest of snowflake so these hybrid tables they're not any second class citizen they work just as you would expect and interact with the rest of snowflake and on top of that you get really high performance so you get orders of magnitude Mo throughput in both reads and writes while still getting these really low latencies for for your transactional workloads now what did we have to do to make this happen what's the magic behind this key word let's take a closer look under the hood now we have optimized the database storage layer but we also had to look at the remaining two layers to make sure that the queries that arrive at the top at the cloud services layer go through the deack through the query processing layer down to storage and back out again as efficiently as possible so in the beginning this wasn't the case so we had to put these other layers on a serious diet so we assigned latency budgets to the IND individual components just very few latency yeah very very few milliseconds per component to arrive at these really low latencies overall we ran benchmarks looked at CPU profiles to see where our CPU Cycles were spent and we found our first culprit perer telemetry peral elry is incredibly useful to us when it comes to addressing our customer inquiries a very often seen customer ticket that we get is how can I make my query run even faster how can I optimize my query and in those cases it's incredibly useful to be able to look at this query look really at this fine great to limit to see at what time stamp did the query do what how long did it take and this en AES us to give very timely feedback to the customer uh to perform the analysis give guidance and mitigation to any customer inquiries now this does come as a cost so for one we have to pay computational costs we need to compute this lock line we need to write it to disk ship it off over the network but also storage costs we need to store these log lines for later retrieval during our analysis and this cost value balance is really fine-tuned for our analytical workloads that make up most of the workloads today now what happens if we looking at transactional workloads here we have to understand the change in scale that is going to come we are looking at orders of magnitude more throughput that is coming and hitting the system and every one of these queries these transactional queries they are actually quite similar so we'll be writing the same log lines over and over again and the storage costs alone for these log lines would actually already breach the budget that we have per unistore query so the solution is easy right just skip writing the log lines not that easy because that would actually interfere with our seamless integration the feature of snowflake that would break with this approach would be the query Details page if you didn't write the Telemetry per query the user wouldn't see anything there so we had to get a bit um yeah we had to solve this problem by as synchronously exporting this Telemetry and getting it to the user maybe a few second delayed but still in a way that it doesn't impact the per query latency as much we also went through all the lock lines in the system and really pruned them looked at which log lines can be combined so actually all queries on across snowflake are benefit benefiting from this change not just the ones that run on hybrid tables another example and here we actually get to the FDB metadata store these were the large scale FDB deployments that Christian talked about uh that snowflake is already running since yeah longer than hybrid tables exist so FTB metadata store is something separate from the FTB data store the FTB data store power cyber tables the FDB metadata store say stores all the metadata for all our tables so when a query would come into the system we would fetch the user object from that store look what roles does this user have do these rols actually allow this user to access the data that he's trying to access or not and each of these queries to the metadata store would go over the network to this FDB deployment so they would incur on the order of milliseconds in latency each time for analytical workloads that's a drop in the bucket these workloads they run for multiple seconds minutes sometimes even hours so a few milliseconds is not that big of a deal but for transactional queries this was prohi prohibitively expensive we cannot afford these to lose these milliseconds but also just if you ignore the milliseconds the change in scale would become a problem if you're suddenly seeing thousands or tens of thousands of CES per second hitting this metadata store we would have to scale it up significantly to handle this traffic which would again increase the costs for hyber tables so the solution here was to implement various local caches in the areas where we would where we used to access the FDB metadata store so these C caches infrequently update the data from the FDB deployment but for these thousands and tens of thousands of queries per second that hit this cache and usually actually query the same object over and over again because the transaction URS they may be hitting the same table they are all going to the in memory cash now so we're talking on the order of nanc now not milliseconds and again all of the snowflakes running uh all of the queres running on snowflake are benefiting from this change not just hybrid tables but it was essential to have this implemented for hybrid tables to reach these low latencies now we've talked a lot about optimizations on the service layer or the brain of snowflake let's take a look at the execution platform layer so also called the muscle of snowflake with the last example that we'll look at and how we optimize that one prerequisite or one assumption for analytical queries that is actually true for them is that most of the work so 95 99% of the work is happening on the execution platform layer so these are the warehouses that our customers pay for and they are downloading huge files highly parallelized scanning them Computing the results or joins and then returning that to the user so for analytical workloads that have are quite complex most of the work is being done here and one design decision that was made early on is to isolate these queries or ensure isolation by having each query run in their own process so you would spin up a process execute the query so fetch the query plan do the work and then return the results to the user now what happens if you're suddenly getting orders of magnitude more queries through this warehouse but every query is also doing much less work so instead of scanning these huge files they're now just updating a key in a key Value Store or fetching a key well the symptoms that we saw in our experiments was incredibly elevated system CPU utilization and for context system CPU utilization should be in the low single digit percent of your overall CPU utilization in general because system CPU means you're spending CPU Cycles to manage these system resources so memory processes threats every time your program goes to the operating system to request resource you're spending Cycles in system CPU now the mitigation for this issue is already in a title we need to reuse these processes now what this picture doesn't convey is this that this change was would actually affect a very broad area of the code base because this would change a really basic underlying assumption which is every process only runs one query so there would be no need to clean up any state or reset any state after executing the query because the process would be relinquished to the system but we were able to revamp our process model to allow for execution of more than one query in a process while still ensuring isolation so we optimized this the muscle of snowflake to not just lift these heavy workloads but also lift a lot of really tiny workloads really efficiently so these three examples that I just shown they were maybe responsible for these really big chunks here we took out of our latency but I didn't mention all the other smaller projects that maybe shaved off a percent of CP utilization here or a millisecond there in total we were able to improve our latency by over a factor of 10 and improve our throughput by several orders of magnitude but the performance improvements in the code they were not the only changes that we had to make in the organization my team was actually mostly involved with developing these new performance testing tools that was incredibly important because we used to measure the performance of analytical workloads which are really big complex queries and try to optimize them to make a 10-minute query maybe take five minutes or less would be very valuable to our customer and also our regression test would test that we don't regress this performance but now with hyber tables we would have to look at how the system performs under highly concurrent workloads so thousands or tens of thousands of queries running at the same time uh per second we had to not just build these tools but also Implement automation processes around them educate the engineers how to interact with these tools and the second part that I was already alluding to during the uh the examples that I showed is around observability so during the query Telemetry example I already said we are dealing here with a different work type of workload and we need to build an observability model that is compatible with these hybrid workloads so we need to be very mindful of which metrics we show and aggregate to the user at what frequency and what glare glare granularity it makes sense to aggregate these metrics depending on whether we are looking at an analytical or a hybrid workload and lastly we had to stress test our production systems before we hand over this really powerful tool that is hybrid tables we need to make sure that our systems that are already running our customer workloads are up to the task and even if you're 99% sure that nothing goes wrong and nothing did go wrong in the end it's still a pretty sweaty Palms inducing activity for even the most seasoned Engineers to push your production system to yeah limits that are never been before been reached now we almost at the end of the talk if there's one thing I want you to take away it's pretty much on this slide unistore performant and easy to use by adding this one keyword to a create table statement you get seamless integration with the rest of snowflake and you get this really high performance for transactional workloads you can use hybd tables as the backend for your apps let's take a final look into the near future of unistore so ga is planned for this year on AWS and our one of our main concerns or areas we work on is price performance we need to improve price performance to stay competitive in the market performance and cost reductions are almost the same thing they usually map to each other if we're able to improve the performance of the product by allowing significantly more throughput through the same Hardware we can offer the cost reductions to our customer while also offering them an upgrade if they need more throughput so it's a win-win situation and actually most of the research and development today on unistore is in the area of Improv the performance and by that the cost and finally this could be seen as a call to action we are optimizing the customer workloads that we see today in the real world we our Engineers are working on looking at the customer the workloads that our customer run on hybrid tables the problems they may may be encountering and they're optimizing the system for these workloads now that we are already in public preview so if you're already running on hybrid tables or plan to in the near future the snowflake engineer may already be optimizing your workload tomorrow and I want to thank you for your attention [Music]