Features of Dedicated SQL Pool Explained

hi guys so you know already quite a lot about dedicated SQL pool this good old traditional data warehouse that is available as a part of signups analytics today I would like to cover some additional features of dedicated SQL pool like workloud management or partitioning so let's get started all right so let's talk about those extra features that make s will pull so nice so the first thing I want to talk about is workload management so let's jump into whiteboard and let me start describing what this feature is so workload management that's the first thing I want to discuss today [Music] management all right so imagine this this is in the middle this is our wonderful SQL pool the one that we designed the one that contains tables inside because it's all about tables probably it's modeled using star schema or snowflake if for some reason we decided that snowflake would be better and now how do we have handle concurrent queries to this to this database and why concurrent think of it so from one from one side oh come on let me change the color to Green so from one side we will have our internal queries for example the ones that we use to populate our tables with data those insert statements update statements at the same time we might have some users that want to run some ad hoc queries some analysis that's what SQL pool is used for to provide data for analysis proposes so ad hoc queries probably we also have some front end tool that is used to well to present the data in more nice form than just a result of uh select queries so we might have powerbi or some other tool that we use for reporting but let's assume that it is a powerbi and powerbi also will try to connect to our pool to get the data so as you can see we have M we might have multiple um principles multi multiple entities that want to get access to our data data at the same time we will have this concurrency so how do we handle this how do we make sure that each of those requests will be handled in a proper way how do we ensure that proper number of resources like memory or CPU will be available to every request and finally how do we make sure that those quaries that we consider more important are handled in the first place and the answer is well that's what what workload management is used for to handle this concurrency so how does it work so let me move this to the site and it is like this so let's say that we've got someone whether it's a process person some software that would like to get access to our data insert update or just select doesn't really matter basically that person wants to quer the data and we know that the query is sent to the dedicated SQL pool to our control node but then what how it is handled so let's assume that this big rectangle is our SQL pool and what happens under the hood is that our query is being classified so the first thing that happens whenever a query is submitt is classification so classification and classification well it simply means that our query is mapped our request is mapped based on some rules that I will talk about in a second to a workload groups basically to some resources that will be used to execute our query all right but workload groups what are those so let me add it to our diagram those red rectangles so we might have multiple workload groups so those red ones these are just workload groups and workload group well it is simply a set of resources that we Define that are available for a specific set of queries so basically it is our job as database designers to think about type of queries that will be run against our database just how we want them to be categorized and well you can take look at those categories that we defined at the beginning of this session so maybe we would Define a workload group a category for our ETL processes maybe we'll Define a category for our finance department because we know that those guys from Finance well they need to have their queries handled in the first place because they are super important and they get very upset if they have to wait for a few seconds to get the results back maybe we'll Define a workload group for some ad hoc queries that well they are not that important and those can wait so the classification is simply a way to assign our query to one of those workload groups all right but how is this classification designed how it is how it is happening under the hood and well basically you would create something called a classifier inside which you would define all set of rules based on which this classification would happen and usually what is used to classify a query is to Simply use identity of the principle that is submitting the query so you might use for example a database username that executed the query or maybe you want to use a database role that this principle has or maybe you want to use an entra login and based on this login so for example if this P if this me then I want to be assigned almost all resources I want to make queries to be executed as fast as possible maybe instead of using individual logins you would like to use entra groups or maybe you would like to classify the queries based on time of a day when they are executed so for example you know that during the day during business hours these are basically some users running the queries and at night it this ETL processes that happens and then maybe on based on this you would assign a different workload group all right so one more time classification is simply a mapping a specific request based on some set of rules to Resource allocations to those workload groups all right but can you elaborate a bit about those work CL groups what they are sure so let's assume that we would like to create a workload group for finance but what does it mean to define a workload group and well workload group is simply a way to reserve set of resources that will be available for particular group of queries for execution so let's say that you would like to configure this workload group for finance guys those ones that are impatient and they want their queries to be executed as fast as possible and workload groups well they will allow us to do this so now let's assume that this blackbox is our memory that is available for query execution and now when we Define a specific workload group and right now we are doing this for finance department we can Define how much memory will be reserved for this particular work Group which simply means how much memory will be waiting for queries from finance department to be to be used for their queries and let's say we know that H they need at least 10% of member more so what we would do would be like this all right so let's set the minimum amount of memory that will be waiting for those Finance guys to 10% and to do this you would use this fancy minan percentage resource parameter when you are creating a workflow group so mean percentage resource and it simply means that no matter what there will be always at least 10% of the whole memory reserved for this Finance workload group no other process no other query will be able to use this no it will be reserved so let's mark it like this so This 10% is reserved no matter what but on the other hand it is just a minimum value but how about the maximum value of memory that those Finance guys can use and maybe we don't want to grant them all available memory so maybe you would like to set the limit to 50% of the memory that is available on our SQL pool and this parameter is called cup percentage resource cup percentage resource resource all right so what does it mean well simply it means that for our finance department there will be always at least 10% of memory waiting for them to be used they can use more if it is available no problem but no more than half of the total memory so we will leave at Le we will leave at least half of the memory for other workload groups so that's how you can reserve a specific set of resources for particular Resource Group worklow group not Resource Group now in case of memory this is a fixed uh limit this cup percentage resource for CPU it is more fluid because for CPU well even if you set it to 50% then if there are no other queries running at at the same time then go ahead you can use all available CPU power for memory it's a fixed limit you can you cannot use more than 50% all right so this 10% is going guaranteed and they can use more if it is available but at least 10% will be available all right but this limit is set for the whole workloud group and what if I have 20 queries from Finance running at the same time can we set amount of memor and CPU that would be available for every of those queries and the answer is yes we can so let's say that you would like to set 5% of the memory and CPU oh come on let me change the color to blue one so for example we would like to set amount of memory that is available to every query from our finance department to at least five 5% and you can do this by using this request a request Min resource resource Grant percent I love those names and you can see that I'm looking at the documentation because there's no way I would remember those names Grand percent Grand percent all right and now you might think hm so if we allow Finance guys to use between 10 and per of memory and each of the queries will use at least 5% of memory then in the it might happen then there will be 10 queries running at the same time right because they are allowed to use half of the memory half of the memory and each of the queries will use 5% right so 50 by 5 equals 10 not really it doesn't work that way because here we've got to take this P um performance level of SQL pool into account and performance level it was this setting this dw100 C and so on that determined number of compute nodes that we have available and apart from number of compute nodes it also determines number of concurrent queries that can be executed at the same time and if we run into documentation you will see that in the cheapest option the one that I'm using right now this DW 100 C there might be only four queries running at the same time so not that much to be honest and in this case this parameter that I was talking about this request mean resource gr per well it will be effectively set to 25% so simply put if you want to have higher concurrency more queries running at the same time you have to pay more and buy more powerful version of SQL pool that's how works but anyway those workload groups well this is a way to isolate our workloads it is a way to reserve resources for them so the important concept the another one in workload management apart from classification is this workload isolation so let me write that down so workload isolation all right and workload isolation well it's it is simply about reserving resources for a workload group all right so that's how it works and finally the third important concept inside workload management is importance so all right we know that our queries are classified those workload groups we know that each of those workload groups will have specific amount of resources available but what if we have let's say four queries from the finance guys and four queries from ETL processes executed at the same time so which one will be executed first what is the order of processing and that's where workload importance comes into the play so let me explain this so it is like this let me draw another rectangle today there will be a lot of rectangles and what is shown here it is a scheduler of our dedicated SQL pool so simply it is a part of the pool that determines which query will be executed as next one and let's assume that these are the queries and we have the cheapest version because well our budget is limited so only four queries can be executed at the same time so let's say that these four are running right now but unfortunately and other queries were also submitted like five 6 7 8 and N but unfortunately well they have to wait they are cute how do we spell it like so all right and by default this scheduler works works in a first in first out first out manner which means that let's say that query number two has finished executing so let me remove it from the Running part so then the next available query will be executed in our case it would be the first one from the queue which is number five so it will jump into the part that executes the query great then let's say number four finished processing and then it leaves a free Spot Free place for the next query which will be query number six and so on and so on all right so let me revert to the initial state so that's how it was at the beginning but let's change it a bit let's say that query number seven is is super important and we would like it to be executed in the first place before query number five and six even though that it was submitted later and can we do this yes we can and workload importance that's the third concept within workload management that can help us with this case so simply what we would have to do would be to assign a high import to our to our workload group let's assume that query number seven is from the workload group that got this High importance so what happens then so again let's assume that quer number two finished processing so there is a place for the next query to be executed but this time Schuler will not pick the next available query from the queue number five no it will take the query the next query with the highest priority which in our case would be query number seven so query number seven would be executed before query five and six and this is importantance so let me write that down importance and importance well it is usually defined at the workload group level so in our case we might say that all right Finance guys they get the highest priority because we know that they cannot wait for the response but those ADOT queries well they are not that important so let's give them the lowest priority out there and when it comes to various levels of importance well actually we have five levels available and they are as follows so in the middle we've got normal the default one then we've got high we've got low and we've got the ones in the middle like above normal and the last one is below normal and simply by using one of those importance levels you can have impact on the way which query is executed as the next one so to summarize this workload management that I was talking about it's all about determining how concurrent queries will be executed so first of all they are classified using our classification uh classifier that usually works on name of the user who is executing the query and it assigns a query to one of those workload groups and there are some workload groups created out of the box so if you don't create any there are some then within each workloud group you would Define amount of resources CPU and memory that will be reserved AA for particular queries Min and Max and finally you can Define importance that would determine which queries should be executed in which order so that's what work Cloud management is all about all right so let's move on to the next cool topic that is available within SQL Po and the next topic is result set caching result set caching really cool stuff and this time I will show you a demo all right so let's jump into our portal I have my poll up and running and that's my sample script that is is executed on my DW test poll and right now well I have some query that's the one it doesn't do anything special actually let me run this it simply calculates number or a sum of parts per minific type and that's the output just regular query one of those that we might expect to be executed on our pool and now if we run this query you'll see that well it takes some time to process it you can see this in the message stop this time it took over 3 seconds to return the data and actually I don't have that much rows that many rows only 14,000 or so and if I keep repeating running the same query this time it was a little bit over 1 second and the next time again a bit over 1 second still not the perfect response time especially that nothing has changed in the data and SQL pool has just executed this query so it should know the result so why it is trying to execute it over and over again so the way to cat the results of the query is to use this results at caching and you might see in my query this option label equals to results at caching test it doesn't do anything it doesn't change the way this query is executed it is just used to assign a label to this query so later on I will be able to find this in various Dynamic management views that shows internals of uh SQL pool so let me run this this time with this option and here I have a query let me run this well that's the query that goes to this s DMB DW exec requests uh View and simply it shows what requests have been executed and the column that is important right now is this this one result cach hit which is set to minus one in this case minus one simply means that results caching is not enabled on on my SQL pool all right but what is this results at caching well it is simply a way to cat the results of your queries so the next time the same query is executed the pool will not try to run it from scratch it will just grab the results from from the cach so it would be faster and by default it is disabled so we have to enable this this and it can be done at the session level or for the whole pool and I will do this for the whole pool by using this alter database statement but important thing is that it has to be executed on the master database so let me switch to master and let me run this alter statement query and it will take half a minute or so and I'll be back in a second actually took only 10 seconds anyway let's switch the database to our DW test and let me rerun our query one more time and see what happens all right data was returned it took well almost 2 seconds this time and if we check our Dynamic management view here and they are sorted the sending so the newest one is on top you will see that actually result cash heit column is set to zero so it means that this results of this execution were not taken from the cach because well we just enabled this so the first time the query runs that's the moment when the results are cashed so if I rerun the query now it should take the results from cash so it should be faster and it is faster so less than half a second still not that great but at least it's faster than it was originally and if we check our exec requests view you will see that this time result cach hit is set to one meaning that the the last execution those results were taken from the cach and if I keep repeating running this query it will be taken from the cach from now on and the execution time should be small smaller than previously so simply that's how it works to to cat the data so it is not executed from scratch every time all right so let me disable this because I don't actually need this and I to do this I have to switch to master again so that's one of the ways to empty the cach or to disable it at all but you might ask all right but when is the cash evicted where is this data removed from the cach so basically it happens automatically every 48 hours every 2 days if this result set was not used or if the data was invalidated so for example if our table that is used as a source in this query has changed if some new data was added to it so that's first case then uh the cash might be evicted if its size reaches the limit which is 1 tab I believe and finally you can do this manually by running one of the quaries dbcc once but by default it is disabled and it is really recommended to enable this to be able to benefit this faster uh um execution times but to be able to use it the query has to be exactly the same as the one that was cached otherwise it will not work also if your query is non deterministic meaning that with every execution it returns a different data it will not be cached and example of a non-deterministic query might be the one if you use get date in in your query so it returns a current time so every time you run it it will return a slightly different values so those queries will not be cached all right so that was the second call think about SQL pull so let's move on to partitioning and I know that partitioning was covered so far it has different meanings depending on a service so let's see how it looks like in case of SQL pool so partti Shing all right so let's talk about this one and actually partitioning in SQL is not a concept created for SQL pool no it exists it existed for many years in SQL server on premises and basically partitioning is a way to improve to improve maintenance of your queries and as a side effect it also might improve query performance in some cases so what is this partitioning so basically when you are creating a table in your dedicated SQL pool you might partition it by some column it would result in splitting it logically into those partitions on which later we can execute some operations that will be beneficial to us and now partitioning is transparent to end users they even don't know that partitioning is enabled the same with distribution it is hidden in internals of SQL pool all right so when to use this partitioning stuff and basically its main goal main use case is for maintenance so imagine a case that you have a big table in data warehousing it is usually a fact table you are partitioning fact tables rather not dimensions and maybe your business logic is that you have to keep your facts for three years only all their data can be removed so so every month what you would do would be to remove data from the oldest month just to make place for for new data and obviously you could do this in a good old way using the delete statement delete from table where some some predicate it would work but the thing is that well it will take time remember that this fact table is probably big it will take time it will create logs on your on your table and it might degrade the overall performance of your database and by using partitioning it can be done much faster so actually let me draw how this scenario might look like so again yet another rectangle I told you that there would be be a lot of rectangles today so let's say that this is our big fact table and later on we'll get back to the part what big means when it is beneficial to use partitioning in the first place and let's say that well we would like to partition it based on some column and in 99% of cases you use a date column for partitioning so let's assume that we did exactly this we partition our fact table based on a sale date or whatever date there is and we created partitions by month so in our oldest partition let's say we have data from January 20 21 in the second partition we've got data from February 2021 then we've got from March 2021 and so on in and in the newest partition we have some data as well and now let's assume that right now we would like to get rid of this data from January 2021 because we no longer need this and we know that well this delete statement it will it will take too much time so we would like to use use this partitioning stuff so let's assume that our table has been partitioned already and to implement this scenario it is called sliding window when we have a sliding window of data of particular range in our case 3 years it would look like this so we would create yet another table that would be empty empty table its schema its structure would be exactly the same as the structure of our big fact table and as a first step what we would do would be to switch partitions that's the first type of operation we can do on partitions tables so we would switch them switch partitions and we would basically switch our whole table the empty one with data with the first partition from our table and what would it effectively mean would be like this data from our partition would be magically moved to the other table just like this it is a metadata operation it is super fast there is no blocking at all so suddenly data from January would disappear from our table we just removed this in a matter of seconds so that's the first thing you would do switch partitions then you would probably want to remove this data from this new table that was empty previously it is not empty right now so you would remove data from it but here because it is a separate table it will not block an operation that uh is executed on the big table this is fine you could even truncate the table it it is completely fine so it will be fast that's great but well we removed the old data but we have to make place for data for a new month right and yes we would do this so as a thir operation we would split partition that exists as the newest partition and we'll simply add a new range in it for a new month let's say for July 2024 or whatever a range is necessary and finally what we would do well this place is empty and we don't need it anymore so let's remove this so as a last step we would merge on those two partitions together which simply means that we would effectively remove this boundary so it would it would become it would become a single partition and we would repeat the process every month so our window would just slide over time that's why it is called a sliding window scenario and actually it is so common case that you can find scripts tsql scripts in the internet that do actually this exactly this just switch the partition um remove data split partitions split ranges and merge partitions so it is just to be configured and then used and here partitioning is beneficial because it makes this removal operation very fast there are no locks used basically at all so you could still quer the data without any issues now the second benefit from the partitioning is that well if we have queries that filter the data based on date for example on a date that was used to partition to partition the data then our query Optimizer would know which partition it has to read so it would be just faster so that's how it works now when it comes to best practices about partitioning well the first important think is all right so which tables you would like to partition so obviously the the big ones and the big ones are facts so you partition your facts your fact tables not dimensions secondly all right so how many partitions should I have thousand no it's probably not a good idea if you have a thousand partitions then probably you mess it up don't create a partition for every day no and the guideline from Microsoft is that you should have less than 100 partitions so don't go crazy with creating them all right right but we know that in case of those big fact tables in SQL pools we would index them using clustered column store index because that's the type of the index that gives many benefits to our analytical queries so again are there any guidelines how to partition those tables and actually they are so let me draw it again yet another rectangle or square more or less so let's assume that this is our fact table because we want to partition fact tables don't do this for Dimension there's no point of doing this usually and now you should remember that in the case of SQL PS dedicated ones our data is distributed into those 60 distributions always obviously you can have impact on the way how data is distributed but it will always be split into those 60 parts so those green bars are our 60 distributions now apart from those distributions we can partition our data and it is done independently from distributions so let's say that those purple things are our partitions partitions and now the guideline from Microsoft is like this that the part that I just highlighted that is a combination of partition per distribution well to benefit from our column store index it has to have at least 1 million rows so this part here 1 million rows to to benefit from the column store index to compress it to close the row groups or how however it is called so this it should give you a hint how big your table should be to even consider partitioning because maybe if your table is not that big then you wouldn't benefit from partitioning at all so number of rows is one of the main criteria you should take into account when deciding about the number of partitions all right so that's how it works but then how do we actually partition our data what's the syntax and how it works so it works like this so when you create a table you use this create table statement that's obvious right so let's do this create table and you would give it a name you would set its definition list of columns data types and so one then you would provide some options like for example distribution how this table should be distributed into those 60 distributions so with distribution distribution equals to one of those methods that are three right and we know that in case of fact tables we we want to use hash and distribution based on some column whatever we believe is uh is a good way to split the data let's say by customer ID like so all right we might set some indexes some this clusters column story index and the important part is that here you would also Define this partition part and to make this work you would simply add partition partition and then indicate which column should be used for partitioning for example sale date and usually you want to partition on the date column so sale date then arrange left or right and in a second I will explain what it means for now let's assume right all right for values and then you would provide a list of this those boundary points so for example in our case let's go like this 20 21 01 0 1 then 20 21 02 01 and 20 21 03 01 so let's define only three boundaries right now and now so what are those things well these are simply those boundary points and here I'm assuming that the syntax is like this here here here here Mon Mon and day day so the first value simply means the first of January the second ones the second one is um the 1 of February and so on and actually those values here that you provide of the US those boundary points they have to be values that are available in the column that you chose for partitioning and usually in case of data warehouses you want those foreign Keys those foreign keys to not have any business meaning that's why you would use those surrogate identity identity Keys like one two three and so on but the only exception is date because for dates and this is a foreign key to a date row usually you construct them in a specific way just to represent a specific date and the benefit of this is that by looking at the ID this key you already know what value it represents so it is very easy to automatically generate those values if you need if you need them in in future for example to to create those new partitions for new months all right but actually how does it look like on a diagram how does it split our data so let's assume that this is our timeline and let's mark those points those ones on our timeline on our data so we've got three points the first one was 20 21 0 1 01 so 1 of January then you've got 20 21 02 01 1st of February and likewise we've got 20 21 03 01 first off March and if we Define our partitions in in this way it simply means we want to split the whole dates by using those points so the first partition would be like this it would be all data that falls within this range so it would be from the beginning of the word till this uh 1 of January so 20 21 01 01 all right but how about those rows that have sale date set to exactly this boundary point would it belong to the left partition or to the right partition and that's exactly what this word determines it might be either set to right or left and it simply means well in case of right and we use right well this boundary point it belongs to the partition to the right so the second partition would be like this from the 1 of um January to the 1st of March if we used left word then this rows with equal to this boundary point they would belong to the partition on the left and so on and so on so in our case by using those three boundary points we defined four partitions there is always one more partition than the number of boundary points and simply that's how partitioning works and in real life well in case of smaller tables you don't have to bother with partitioning at all please remember about this recommendation that this part should have at least one mill of rows so think twice before you start partitioning because it might not give you any benefit and actually it might degrade your performance because if your table is not big enough it will not benefit from the clustered column store index all right so this was partitioning and the last thing for today are some tsql commands so the first one before I jump to portal let's talk about pivoting and transforming the data so let me write it down p and unpivot unpivot so what is this all about so let's assume that we have some data in our table we have our minifix and in minifix we have a column called type or minific type something like this and we also have a column that determines how many parts a given minific has so let's say that type A has 10 parts this row with type a uh the second row of type B has two parts the next row has one row and again there's yet another row of type a that has two rows and this is our table great but what if you would like to transform it a bit and have it available with type values as columns and in cells you would like to have a sum of parts so it would look like this we would have those a b and c columns as column names and as values here we would like to have a sum of Parts obviously grouped by type so in our case it work like this so for a we would simply sum this and this value so it would be 12 for B it would be 1 for C it would be uh for B it would be two for C it would be one and the way to do this using tsql queries is simply to use a pivot statement and it is available not only in SQL pool but in SQL server on premises as well I've used pivot maybe once in my whole career and I can't remember the syntax but it is possible so maybe if you need to do this pivot is there the reverse operation well it is simply unpivot just like in Excel you might compare it with these pivot tables in Excel so that's how you can transform your data um by pivoting columns with rows and finally let me jump back to to Azure portal and let me talk about two additional things so first of all we've got identity so you know that in case of data warehouses we often would like to have this surrogate key this artificial ID that doesn't have any business value maybe except for ID for dates that would be used as a foreign key and to automatically generate those values well you can use identity property on a column so let me run this query so let me first drop the table if I have it no I don't have it and then let me create a new call new table called called minifix with ID in which I have this new column called ID that is um defined as integer with this identity one one property so it means means well simply start from one and increment values by one and then have some uh distribution some indexing without partitioning because there's no point in doing this for this table at least so the table has been created and then let's just insert some rows into this table and please note that well I'm taking the data from our existing minifix table but I'm not providing any value for for this ID column because that's why we used identity property to automatically generate those values so let me run this it took two seconds and then if I query this newly populated table and if I sort this by our ID you will see that indeed this ID column has been populated with those values 1 2 and so on so it is done automatically for us so we can use this identity property in case of columns that you want to treat as those surrogate Keys those artificial IDs something similar with it in data bricks so that was identity and the last thing for today is data skew so when I was talking about this has distribution method I told you that you should pay attention to data distribution so for example if you have rows sales rows sales transactions and 90% of sales is made by a specific customer a single customer then if you has your data base on this column you might have the very nonuniform distribution of data so most of the data would land on a single partition which means that the performance of your queries might suffer because on the one distribution there will be way more data to be processed than on others and the way to check if you're if you have this data skew is to use this PW PDW show space used command and actually take look that it uses this PDW prefix it is from old days when this product SQL pool was called parallel data warehouse it was on premises many times ago this product it was renamed several times and finally it landed as a part of SQL pool and if I run this for our minifix table it will simply show me number of rows available for every partition for every distribution sorry so we have 60 rows in our or set because that's how many distributions we have and you can just see how many rows are on each of those distributions so if you compare them you will see if there are some that have way more data than the other ones all right and that was it for today so today I wanted to cover those extra features in SQL pool like workload management that is all about Reserve ing resources for your queries and setting how they should be processed in which order then we covered results at caching is a way to well to improve performance of your queries so they are taken from the cach instead of uh being executed every time then we talked about partitioning which is about improving your maintenance especially in case of this sliding window scenario and finally we talked about some SQL commands like pivot identity and this show space used Comm so thanks for watching and see you next time take care

hi guys so you know already quite a lot about dedicated SQL pool this good old traditional data warehouse that is available as a part of signups analytics today I would like to cover some additional features of dedicated SQL pool like workloud management or partitioning so let&#39;s get started all right so let&#39;s talk about those extra features that make s will pull so nice so the first thing I want to talk about is workload management so let&#39;s jump into whiteboard and let me start describing what this feature is so workload management that&#39;s the first thing I want to discuss today [Music] management all right so imagine this this is in the middle this is our wonderful SQL pool the one that we designed the one that contains tables inside because it&#39;s all about tables probably it&#39;s modeled using star schema or snowflake if for some reason we decided that snowflake would be better and now how do we have handle concurrent queries to this to this database and why concurrent think of it so from one from one side oh come on let me change the color to Green so from one side we will have our internal queries for example the ones that we use to populate our tables with data those insert statements update statements at the same time we might have some users that want to run some ad hoc queries some analysis that&#39;s what SQL pool is used for to provide data for analysis proposes so ad hoc queries probably we also have some front end tool that is used to well to present the data in more nice form than just a result of uh select queries so we might have powerbi or some other tool that we use for reporting but let&#39;s assume that it is a powerbi and powerbi also will try to connect to our pool to get the data so as you can see we have M we might have multiple um principles multi multiple entities that want to get access to our data data at the same time we will have this concurrency so how do we handle this how do we make sure that each of those requests will be handled in a proper way how do we ensure that proper number of resources like memory or CPU will be available to every request and finally how do we make sure that those quaries that we consider more important are handled in the first place and the answer is well that&#39;s what what workload management is used for to handle this concurrency so how does it work so let me move this to the site and it is like this so let&#39;s say that we&#39;ve got someone whether it&#39;s a process person some software that would like to get access to our data insert update or just select doesn&#39;t really matter basically that person wants to quer the data and we know that the query is sent to the dedicated SQL pool to our control node but then what how it is handled so let&#39;s assume that this big rectangle is our SQL pool and what happens under the hood is that our query is being classified so the first thing that happens whenever a query is submitt is classification so classification and classification well it simply means that our query is mapped our request is mapped based on some rules that I will talk about in a second to a workload groups basically to some resources that will be used to execute our query all right but workload groups what are those so let me add it to our diagram those red rectangles so we might have multiple workload groups so those red ones these are just workload groups and workload group well it is simply a set of resources that we Define that are available for a specific set of queries so basically it is our job as database designers to think about type of queries that will be run against our database just how we want them to be categorized and well you can take look at those categories that we defined at the beginning of this session so maybe we would Define a workload group a category for our ETL processes maybe we&#39;ll Define a category for our finance department because we know that those guys from Finance well they need to have their queries handled in the first place because they are super important and they get very upset if they have to wait for a few seconds to get the results back maybe we&#39;ll Define a workload group for some ad hoc queries that well they are not that important and those can wait so the classification is simply a way to assign our query to one of those workload groups all right but how is this classification designed how it is how it is happening under the hood and well basically you would create something called a classifier inside which you would define all set of rules based on which this classification would happen and usually what is used to classify a query is to Simply use identity of the principle that is submitting the query so you might use for example a database username that executed the query or maybe you want to use a database role that this principle has or maybe you want to use an entra login and based on this login so for example if this P if this me then I want to be assigned almost all resources I want to make queries to be executed as fast as possible maybe instead of using individual logins you would like to use entra groups or maybe you would like to classify the queries based on time of a day when they are executed so for example you know that during the day during business hours these are basically some users running the queries and at night it this ETL processes that happens and then maybe on based on this you would assign a different workload group all right so one more time classification is simply a mapping a specific request based on some set of rules to Resource allocations to those workload groups all right but can you elaborate a bit about those work CL groups what they are sure so let&#39;s assume that we would like to create a workload group for finance but what does it mean to define a workload group and well workload group is simply a way to reserve set of resources that will be available for particular group of queries for execution so let&#39;s say that you would like to configure this workload group for finance guys those ones that are impatient and they want their queries to be executed as fast as possible and workload groups well they will allow us to do this so now let&#39;s assume that this blackbox is our memory that is available for query execution and now when we Define a specific workload group and right now we are doing this for finance department we can Define how much memory will be reserved for this particular work Group which simply means how much memory will be waiting for queries from finance department to be to be used for their queries and let&#39;s say we know that H they need at least 10% of member more so what we would do would be like this all right so let&#39;s set the minimum amount of memory that will be waiting for those Finance guys to 10% and to do this you would use this fancy minan percentage resource parameter when you are creating a workflow group so mean percentage resource and it simply means that no matter what there will be always at least 10% of the whole memory reserved for this Finance workload group no other process no other query will be able to use this no it will be reserved so let&#39;s mark it like this so This 10% is reserved no matter what but on the other hand it is just a minimum value but how about the maximum value of memory that those Finance guys can use and maybe we don&#39;t want to grant them all available memory so maybe you would like to set the limit to 50% of the memory that is available on our SQL pool and this parameter is called cup percentage resource cup percentage resource resource all right so what does it mean well simply it means that for our finance department there will be always at least 10% of memory waiting for them to be used they can use more if it is available no problem but no more than half of the total memory so we will leave at Le we will leave at least half of the memory for other workload groups so that&#39;s how you can reserve a specific set of resources for particular Resource Group worklow group not Resource Group now in case of memory this is a fixed uh limit this cup percentage resource for CPU it is more fluid because for CPU well even if you set it to 50% then if there are no other queries running at at the same time then go ahead you can use all available CPU power for memory it&#39;s a fixed limit you can you cannot use more than 50% all right so this 10% is going guaranteed and they can use more if it is available but at least 10% will be available all right but this limit is set for the whole workloud group and what if I have 20 queries from Finance running at the same time can we set amount of memor and CPU that would be available for every of those queries and the answer is yes we can so let&#39;s say that you would like to set 5% of the memory and CPU oh come on let me change the color to blue one so for example we would like to set amount of memory that is available to every query from our finance department to at least five 5% and you can do this by using this request a request Min resource resource Grant percent I love those names and you can see that I&#39;m looking at the documentation because there&#39;s no way I would remember those names Grand percent Grand percent all right and now you might think hm so if we allow Finance guys to use between 10 and per of memory and each of the queries will use at least 5% of memory then in the it might happen then there will be 10 queries running at the same time right because they are allowed to use half of the memory half of the memory and each of the queries will use 5% right so 50 by 5 equals 10 not really it doesn&#39;t work that way because here we&#39;ve got to take this P um performance level of SQL pool into account and performance level it was this setting this dw100 C and so on that determined number of compute nodes that we have available and apart from number of compute nodes it also determines number of concurrent queries that can be executed at the same time and if we run into documentation you will see that in the cheapest option the one that I&#39;m using right now this DW 100 C there might be only four queries running at the same time so not that much to be honest and in this case this parameter that I was talking about this request mean resource gr per well it will be effectively set to 25% so simply put if you want to have higher concurrency more queries running at the same time you have to pay more and buy more powerful version of SQL pool that&#39;s how works but anyway those workload groups well this is a way to isolate our workloads it is a way to reserve resources for them so the important concept the another one in workload management apart from classification is this workload isolation so let me write that down so workload isolation all right and workload isolation well it&#39;s it is simply about reserving resources for a workload group all right so that&#39;s how it works and finally the third important concept inside workload management is importance so all right we know that our queries are classified those workload groups we know that each of those workload groups will have specific amount of resources available but what if we have let&#39;s say four queries from the finance guys and four queries from ETL processes executed at the same time so which one will be executed first what is the order of processing and that&#39;s where workload importance comes into the play so let me explain this so it is like this let me draw another rectangle today there will be a lot of rectangles and what is shown here it is a scheduler of our dedicated SQL pool so simply it is a part of the pool that determines which query will be executed as next one and let&#39;s assume that these are the queries and we have the cheapest version because well our budget is limited so only four queries can be executed at the same time so let&#39;s say that these four are running right now but unfortunately and other queries were also submitted like five 6 7 8 and N but unfortunately well they have to wait they are cute how do we spell it like so all right and by default this scheduler works works in a first in first out first out manner which means that let&#39;s say that query number two has finished executing so let me remove it from the Running part so then the next available query will be executed in our case it would be the first one from the queue which is number five so it will jump into the part that executes the query great then let&#39;s say number four finished processing and then it leaves a free Spot Free place for the next query which will be query number six and so on and so on all right so let me revert to the initial state so that&#39;s how it was at the beginning but let&#39;s change it a bit let&#39;s say that query number seven is is super important and we would like it to be executed in the first place before query number five and six even though that it was submitted later and can we do this yes we can and workload importance that&#39;s the third concept within workload management that can help us with this case so simply what we would have to do would be to assign a high import to our to our workload group let&#39;s assume that query number seven is from the workload group that got this High importance so what happens then so again let&#39;s assume that quer number two finished processing so there is a place for the next query to be executed but this time Schuler will not pick the next available query from the queue number five no it will take the query the next query with the highest priority which in our case would be query number seven so query number seven would be executed before query five and six and this is importantance so let me write that down importance and importance well it is usually defined at the workload group level so in our case we might say that all right Finance guys they get the highest priority because we know that they cannot wait for the response but those ADOT queries well they are not that important so let&#39;s give them the lowest priority out there and when it comes to various levels of importance well actually we have five levels available and they are as follows so in the middle we&#39;ve got normal the default one then we&#39;ve got high we&#39;ve got low and we&#39;ve got the ones in the middle like above normal and the last one is below normal and simply by using one of those importance levels you can have impact on the way which query is executed as the next one so to summarize this workload management that I was talking about it&#39;s all about determining how concurrent queries will be executed so first of all they are classified using our classification uh classifier that usually works on name of the user who is executing the query and it assigns a query to one of those workload groups and there are some workload groups created out of the box so if you don&#39;t create any there are some then within each workloud group you would Define amount of resources CPU and memory that will be reserved AA for particular queries Min and Max and finally you can Define importance that would determine which queries should be executed in which order so that&#39;s what work Cloud management is all about all right so let&#39;s move on to the next cool topic that is available within SQL Po and the next topic is result set caching result set caching really cool stuff and this time I will show you a demo all right so let&#39;s jump into our portal I have my poll up and running and that&#39;s my sample script that is is executed on my DW test poll and right now well I have some query that&#39;s the one it doesn&#39;t do anything special actually let me run this it simply calculates number or a sum of parts per minific type and that&#39;s the output just regular query one of those that we might expect to be executed on our pool and now if we run this query you&#39;ll see that well it takes some time to process it you can see this in the message stop this time it took over 3 seconds to return the data and actually I don&#39;t have that much rows that many rows only 14,000 or so and if I keep repeating running the same query this time it was a little bit over 1 second and the next time again a bit over 1 second still not the perfect response time especially that nothing has changed in the data and SQL pool has just executed this query so it should know the result so why it is trying to execute it over and over again so the way to cat the results of the query is to use this results at caching and you might see in my query this option label equals to results at caching test it doesn&#39;t do anything it doesn&#39;t change the way this query is executed it is just used to assign a label to this query so later on I will be able to find this in various Dynamic management views that shows internals of uh SQL pool so let me run this this time with this option and here I have a query let me run this well that&#39;s the query that goes to this s DMB DW exec requests uh View and simply it shows what requests have been executed and the column that is important right now is this this one result cach hit which is set to minus one in this case minus one simply means that results caching is not enabled on on my SQL pool all right but what is this results at caching well it is simply a way to cat the results of your queries so the next time the same query is executed the pool will not try to run it from scratch it will just grab the results from from the cach so it would be faster and by default it is disabled so we have to enable this this and it can be done at the session level or for the whole pool and I will do this for the whole pool by using this alter database statement but important thing is that it has to be executed on the master database so let me switch to master and let me run this alter statement query and it will take half a minute or so and I&#39;ll be back in a second actually took only 10 seconds anyway let&#39;s switch the database to our DW test and let me rerun our query one more time and see what happens all right data was returned it took well almost 2 seconds this time and if we check our Dynamic management view here and they are sorted the sending so the newest one is on top you will see that actually result cash heit column is set to zero so it means that this results of this execution were not taken from the cach because well we just enabled this so the first time the query runs that&#39;s the moment when the results are cashed so if I rerun the query now it should take the results from cash so it should be faster and it is faster so less than half a second still not that great but at least it&#39;s faster than it was originally and if we check our exec requests view you will see that this time result cach hit is set to one meaning that the the last execution those results were taken from the cach and if I keep repeating running this query it will be taken from the cach from now on and the execution time should be small smaller than previously so simply that&#39;s how it works to to cat the data so it is not executed from scratch every time all right so let me disable this because I don&#39;t actually need this and I to do this I have to switch to master again so that&#39;s one of the ways to empty the cach or to disable it at all but you might ask all right but when is the cash evicted where is this data removed from the cach so basically it happens automatically every 48 hours every 2 days if this result set was not used or if the data was invalidated so for example if our table that is used as a source in this query has changed if some new data was added to it so that&#39;s first case then uh the cash might be evicted if its size reaches the limit which is 1 tab I believe and finally you can do this manually by running one of the quaries dbcc once but by default it is disabled and it is really recommended to enable this to be able to benefit this faster uh um execution times but to be able to use it the query has to be exactly the same as the one that was cached otherwise it will not work also if your query is non deterministic meaning that with every execution it returns a different data it will not be cached and example of a non-deterministic query might be the one if you use get date in in your query so it returns a current time so every time you run it it will return a slightly different values so those queries will not be cached all right so that was the second call think about SQL pull so let&#39;s move on to partitioning and I know that partitioning was covered so far it has different meanings depending on a service so let&#39;s see how it looks like in case of SQL pool so partti Shing all right so let&#39;s talk about this one and actually partitioning in SQL is not a concept created for SQL pool no it exists it existed for many years in SQL server on premises and basically partitioning is a way to improve to improve maintenance of your queries and as a side effect it also might improve query performance in some cases so what is this partitioning so basically when you are creating a table in your dedicated SQL pool you might partition it by some column it would result in splitting it logically into those partitions on which later we can execute some operations that will be beneficial to us and now partitioning is transparent to end users they even don&#39;t know that partitioning is enabled the same with distribution it is hidden in internals of SQL pool all right so when to use this partitioning stuff and basically its main goal main use case is for maintenance so imagine a case that you have a big table in data warehousing it is usually a fact table you are partitioning fact tables rather not dimensions and maybe your business logic is that you have to keep your facts for three years only all their data can be removed so so every month what you would do would be to remove data from the oldest month just to make place for for new data and obviously you could do this in a good old way using the delete statement delete from table where some some predicate it would work but the thing is that well it will take time remember that this fact table is probably big it will take time it will create logs on your on your table and it might degrade the overall performance of your database and by using partitioning it can be done much faster so actually let me draw how this scenario might look like so again yet another rectangle I told you that there would be be a lot of rectangles today so let&#39;s say that this is our big fact table and later on we&#39;ll get back to the part what big means when it is beneficial to use partitioning in the first place and let&#39;s say that well we would like to partition it based on some column and in 99% of cases you use a date column for partitioning so let&#39;s assume that we did exactly this we partition our fact table based on a sale date or whatever date there is and we created partitions by month so in our oldest partition let&#39;s say we have data from January 20 21 in the second partition we&#39;ve got data from February 2021 then we&#39;ve got from March 2021 and so on in and in the newest partition we have some data as well and now let&#39;s assume that right now we would like to get rid of this data from January 2021 because we no longer need this and we know that well this delete statement it will it will take too much time so we would like to use use this partitioning stuff so let&#39;s assume that our table has been partitioned already and to implement this scenario it is called sliding window when we have a sliding window of data of particular range in our case 3 years it would look like this so we would create yet another table that would be empty empty table its schema its structure would be exactly the same as the structure of our big fact table and as a first step what we would do would be to switch partitions that&#39;s the first type of operation we can do on partitions tables so we would switch them switch partitions and we would basically switch our whole table the empty one with data with the first partition from our table and what would it effectively mean would be like this data from our partition would be magically moved to the other table just like this it is a metadata operation it is super fast there is no blocking at all so suddenly data from January would disappear from our table we just removed this in a matter of seconds so that&#39;s the first thing you would do switch partitions then you would probably want to remove this data from this new table that was empty previously it is not empty right now so you would remove data from it but here because it is a separate table it will not block an operation that uh is executed on the big table this is fine you could even truncate the table it it is completely fine so it will be fast that&#39;s great but well we removed the old data but we have to make place for data for a new month right and yes we would do this so as a thir operation we would split partition that exists as the newest partition and we&#39;ll simply add a new range in it for a new month let&#39;s say for July 2024 or whatever a range is necessary and finally what we would do well this place is empty and we don&#39;t need it anymore so let&#39;s remove this so as a last step we would merge on those two partitions together which simply means that we would effectively remove this boundary so it would it would become it would become a single partition and we would repeat the process every month so our window would just slide over time that&#39;s why it is called a sliding window scenario and actually it is so common case that you can find scripts tsql scripts in the internet that do actually this exactly this just switch the partition um remove data split partitions split ranges and merge partitions so it is just to be configured and then used and here partitioning is beneficial because it makes this removal operation very fast there are no locks used basically at all so you could still quer the data without any issues now the second benefit from the partitioning is that well if we have queries that filter the data based on date for example on a date that was used to partition to partition the data then our query Optimizer would know which partition it has to read so it would be just faster so that&#39;s how it works now when it comes to best practices about partitioning well the first important think is all right so which tables you would like to partition so obviously the the big ones and the big ones are facts so you partition your facts your fact tables not dimensions secondly all right so how many partitions should I have thousand no it&#39;s probably not a good idea if you have a thousand partitions then probably you mess it up don&#39;t create a partition for every day no and the guideline from Microsoft is that you should have less than 100 partitions so don&#39;t go crazy with creating them all right right but we know that in case of those big fact tables in SQL pools we would index them using clustered column store index because that&#39;s the type of the index that gives many benefits to our analytical queries so again are there any guidelines how to partition those tables and actually they are so let me draw it again yet another rectangle or square more or less so let&#39;s assume that this is our fact table because we want to partition fact tables don&#39;t do this for Dimension there&#39;s no point of doing this usually and now you should remember that in the case of SQL PS dedicated ones our data is distributed into those 60 distributions always obviously you can have impact on the way how data is distributed but it will always be split into those 60 parts so those green bars are our 60 distributions now apart from those distributions we can partition our data and it is done independently from distributions so let&#39;s say that those purple things are our partitions partitions and now the guideline from Microsoft is like this that the part that I just highlighted that is a combination of partition per distribution well to benefit from our column store index it has to have at least 1 million rows so this part here 1 million rows to to benefit from the column store index to compress it to close the row groups or how however it is called so this it should give you a hint how big your table should be to even consider partitioning because maybe if your table is not that big then you wouldn&#39;t benefit from partitioning at all so number of rows is one of the main criteria you should take into account when deciding about the number of partitions all right so that&#39;s how it works but then how do we actually partition our data what&#39;s the syntax and how it works so it works like this so when you create a table you use this create table statement that&#39;s obvious right so let&#39;s do this create table and you would give it a name you would set its definition list of columns data types and so one then you would provide some options like for example distribution how this table should be distributed into those 60 distributions so with distribution distribution equals to one of those methods that are three right and we know that in case of fact tables we we want to use hash and distribution based on some column whatever we believe is uh is a good way to split the data let&#39;s say by customer ID like so all right we might set some indexes some this clusters column story index and the important part is that here you would also Define this partition part and to make this work you would simply add partition partition and then indicate which column should be used for partitioning for example sale date and usually you want to partition on the date column so sale date then arrange left or right and in a second I will explain what it means for now let&#39;s assume right all right for values and then you would provide a list of this those boundary points so for example in our case let&#39;s go like this 20 21 01 0 1 then 20 21 02 01 and 20 21 03 01 so let&#39;s define only three boundaries right now and now so what are those things well these are simply those boundary points and here I&#39;m assuming that the syntax is like this here here here here Mon Mon and day day so the first value simply means the first of January the second ones the second one is um the 1 of February and so on and actually those values here that you provide of the US those boundary points they have to be values that are available in the column that you chose for partitioning and usually in case of data warehouses you want those foreign Keys those foreign keys to not have any business meaning that&#39;s why you would use those surrogate identity identity Keys like one two three and so on but the only exception is date because for dates and this is a foreign key to a date row usually you construct them in a specific way just to represent a specific date and the benefit of this is that by looking at the ID this key you already know what value it represents so it is very easy to automatically generate those values if you need if you need them in in future for example to to create those new partitions for new months all right but actually how does it look like on a diagram how does it split our data so let&#39;s assume that this is our timeline and let&#39;s mark those points those ones on our timeline on our data so we&#39;ve got three points the first one was 20 21 0 1 01 so 1 of January then you&#39;ve got 20 21 02 01 1st of February and likewise we&#39;ve got 20 21 03 01 first off March and if we Define our partitions in in this way it simply means we want to split the whole dates by using those points so the first partition would be like this it would be all data that falls within this range so it would be from the beginning of the word till this uh 1 of January so 20 21 01 01 all right but how about those rows that have sale date set to exactly this boundary point would it belong to the left partition or to the right partition and that&#39;s exactly what this word determines it might be either set to right or left and it simply means well in case of right and we use right well this boundary point it belongs to the partition to the right so the second partition would be like this from the 1 of um January to the 1st of March if we used left word then this rows with equal to this boundary point they would belong to the partition on the left and so on and so on so in our case by using those three boundary points we defined four partitions there is always one more partition than the number of boundary points and simply that&#39;s how partitioning works and in real life well in case of smaller tables you don&#39;t have to bother with partitioning at all please remember about this recommendation that this part should have at least one mill of rows so think twice before you start partitioning because it might not give you any benefit and actually it might degrade your performance because if your table is not big enough it will not benefit from the clustered column store index all right so this was partitioning and the last thing for today are some tsql commands so the first one before I jump to portal let&#39;s talk about pivoting and transforming the data so let me write it down p and unpivot unpivot so what is this all about so let&#39;s assume that we have some data in our table we have our minifix and in minifix we have a column called type or minific type something like this and we also have a column that determines how many parts a given minific has so let&#39;s say that type A has 10 parts this row with type a uh the second row of type B has two parts the next row has one row and again there&#39;s yet another row of type a that has two rows and this is our table great but what if you would like to transform it a bit and have it available with type values as columns and in cells you would like to have a sum of parts so it would look like this we would have those a b and c columns as column names and as values here we would like to have a sum of Parts obviously grouped by type so in our case it work like this so for a we would simply sum this and this value so it would be 12 for B it would be 1 for C it would be uh for B it would be two for C it would be one and the way to do this using tsql queries is simply to use a pivot statement and it is available not only in SQL pool but in SQL server on premises as well I&#39;ve used pivot maybe once in my whole career and I can&#39;t remember the syntax but it is possible so maybe if you need to do this pivot is there the reverse operation well it is simply unpivot just like in Excel you might compare it with these pivot tables in Excel so that&#39;s how you can transform your data um by pivoting columns with rows and finally let me jump back to to Azure portal and let me talk about two additional things so first of all we&#39;ve got identity so you know that in case of data warehouses we often would like to have this surrogate key this artificial ID that doesn&#39;t have any business value maybe except for ID for dates that would be used as a foreign key and to automatically generate those values well you can use identity property on a column so let me run this query so let me first drop the table if I have it no I don&#39;t have it and then let me create a new call new table called called minifix with ID in which I have this new column called ID that is um defined as integer with this identity one one property so it means means well simply start from one and increment values by one and then have some uh distribution some indexing without partitioning because there&#39;s no point in doing this for this table at least so the table has been created and then let&#39;s just insert some rows into this table and please note that well I&#39;m taking the data from our existing minifix table but I&#39;m not providing any value for for this ID column because that&#39;s why we used identity property to automatically generate those values so let me run this it took two seconds and then if I query this newly populated table and if I sort this by our ID you will see that indeed this ID column has been populated with those values 1 2 and so on so it is done automatically for us so we can use this identity property in case of columns that you want to treat as those surrogate Keys those artificial IDs something similar with it in data bricks so that was identity and the last thing for today is data skew so when I was talking about this has distribution method I told you that you should pay attention to data distribution so for example if you have rows sales rows sales transactions and 90% of sales is made by a specific customer a single customer then if you has your data base on this column you might have the very nonuniform distribution of data so most of the data would land on a single partition which means that the performance of your queries might suffer because on the one distribution there will be way more data to be processed than on others and the way to check if you&#39;re if you have this data skew is to use this PW PDW show space used command and actually take look that it uses this PDW prefix it is from old days when this product SQL pool was called parallel data warehouse it was on premises many times ago this product it was renamed several times and finally it landed as a part of SQL pool and if I run this for our minifix table it will simply show me number of rows available for every partition for every distribution sorry so we have 60 rows in our or set because that&#39;s how many distributions we have and you can just see how many rows are on each of those distributions so if you compare them you will see if there are some that have way more data than the other ones all right and that was it for today so today I wanted to cover those extra features in SQL pool like workload management that is all about Reserve ing resources for your queries and setting how they should be processed in which order then we covered results at caching is a way to well to improve performance of your queries so they are taken from the cach instead of uh being executed every time then we talked about partitioning which is about improving your maintenance especially in case of this sliding window scenario and finally we talked about some SQL commands like pivot identity and this show space used Comm so thanks for watching and see you next time take care

Transcript for:Features of Dedicated SQL Pool Explained

Transcript for:
Features of Dedicated SQL Pool Explained