Understanding Serverless SQL Pool in Synapse

hey there so let's say you have your data warehouse created using SQL dedicated pool great you managed all this workload stuff distributions partitioning accesses and everything is working just fine but then you start thinking all right but do I really need this additional layer after our data Lake which sores a copy of our data maybe you would like to go into this newer approach The Lakehouse one when you don't have this additional layer and is it possible with signups and the answer is yes so we can use serverless SQL pool to achieve this and that's the topic for today so let's get started all right so today we're going to focus on the last component of signups that I would like to talk about so let's start with summarizing what kind of capab is in signups we've covered so far all right so we've got signups analytics and you know at this point that signups well it is just this umbrella that combines multiple subservices under the same service under the same hood and what we know already that exist in signups are for example Pipelines so we've got pipelines which is basically the same as data Factory and pipelines well they can be used to orchestrate the whole processing or to just in ingest the data that's the place where they fit very well also as a part of pipelines we know that we've got data flows the same stuff as in case of data Factory data flows and data flows they can be used to transform our data in a visual way without writing any code so that was the first component then we know that all right data flows are maybe fine for data Transformations but if you would like to have more control over the stuff that is happening in your processing then data flows well they don't give you this control so you want to create your own code and the way to do this is to use spark pools spark PSE and then write your notebooks and basically this component of signups is something similars to data bricks at least to data bricks clusters because data bricks is something way more than just notebooks data bricks all right so that was used for transformation proposes and what we also know that exists in signups are SQL pools and so far we learned about the first type of pool which is dedicated dedicated and dedicated SQL pool well it is the example of this good old traditional data warehousing when you have a separate database that physically stores our data to which we can connect using tsql that has those regular tables good old database but that's not everything that signups has to offer because today I will talk about yet another flavor of SQL pools in signups namely serverless because we've got two types of pools SQL pools so basically that's the whole landscape of signups that I want you to be aware of those pipelines spark pools and two types of SQL pools dedicated and serverless ones all right but what is this serverless pool where does it fit why should we use it so first first of all it is really a true serverless feature so first of all you don't have to create it and please remember that in case of dedicated SQL pools we have to first provision the whole signups workspace and this doesn't change in case of serverless but then inside our workspace we had to explicitly create a dedicated SQL pool and choose the performance level this DW 100c slider that determined many features like parallelism number of concurrent queries amount of memory that is available for processing number of worker nodes and obviously a price in the case of serverless we don't do this it simply exists out of the box whenever we create signups workspace the serverless SQL pool is automatically created for us so it just is there and it is ready to be used secondly unlike dedicated SQL pool for serverless we pay for what we use in case of dedicated well we we pay for the up time and the price might be quite significant you should remember that in the cheapest option the one that I've been using it was one and a half EUR per hour or something like this in the cheapest tier in case of serverless there is no price associated with up time no in serverless you pay for what you use and basically you pay for amount of data that you processed and the price is quite interesting because it is really cheap so basically you pay $5 for one terabyte of processed data so it might be really a cheap option now but what this processed data means so first of all it is data that you read from your Source data that you write to your Target location and some intermediate results that are necessary during the processing so actually depending on the way you use serverless it might be super cheap or it might be not that cheap if you are cering a lot of data very often but basically I would position this service in the cheap area of services at least it should be way cheaper than dedicated and obviously the architecture is completely different because in serverless you forget about the those distributions they don't exist simply serverless is created for us you don't care about the infrastructure it uses a different architecture so it is no longer this massively parallel processing mpp1 now in case of serverless it is called distributed query processing but basically it means that serverless scales automatically depending on the workload so you shouldn't even bother about those compute stuff that is used under the hood you just use it and you pay based on the amount of data uh you're processing all right and that's great but when should we use it what are the use cases for serverless and basically well there are three main use cases so actually let me write this down use cases for serverless so we've got serverless let me just move white word a bit server as SQL pool use cases all right and we've got but three primary ones so first of all the one that I will start with is that you can use sever SQL pool for ad hoc data exploration and in a second I will explain what I mean by this so ad hoc data exploration right so let's say that you're you're working with data and you received some CSV file or some Park file and you just want to see what's inside and now in the case of CSV file that's easier because you can open a CSV even in Excel and it will format it more or less correctly even notebook is able to open CSV file but working with csvs in notebooks well it is not the best user experience especially if your file is Big than note note note um notepad or notebook notepad will just crash it will not be able to open it it is even more difficult with parket files which well they require specific software to be able to open this file and you know that you could use this open source par viewer application but it is a third party tool and it doesn't offer that much functionalities and actually that's where serverless SQL pool can be very useful because it allows you to very easily explore your data so let me show you how it looks like so let's jump into Azure portal so that's my signups workspace let me open signup Studio and in a second I will show you how you can query your data so let me go to the data tab to linked area and here we've got or I have several data legs that are linked to my workspace and I will focus on this one ad less than two uncore Ls so that's the one that's our main data Lake that we've been using so far and on this data Lake I've got this curated container that contains data that is already transformed and we've got this rubri cable directory and for example what we have is this minifix um signups directory I believe that's the mini fix that we transformed using sign spark and they were just saved as Delta yeah we can see that this is a Delta format because we've got par files and this Delta lock all right but let's say that I would like to see what's inside this PAR file I wouldn't be able to do this in note uh notepad but here in signups I just right click on this file and then select new SQL script and select top 100 rows and let's see what happens it generated a select statement that uses this super cool open red function and we'll get back to it in a second and the thing to run it is to make sure to connect this time oh sorry connect to this serverless builtin pool not to the our dedicated one now we want to use serverless so that's this one and as you can see it's it has this it has it has this it has this green check mark which means that it is available basically it should be always available for you just waiting to be used and you don't pay for this waiting time you just pay for the time for the amount of data that you process so let me run this query and I will explain what it does so take a look I just using a regular select top 100 from SQL statement which everyone knows how to use right and what is super cool is this open rowet function that allows us to query data that leaves externally compared to our feverless pool in this case I'm cing the data from my data Lake and the syntax is super easy I just need to provide this bulk keyword then provide a pu to the file that I want to read and finally indicate what is the format of the file that I want to read and in this case it is par and that's it and take a look that data was returned as a table so now I can use full power of SQL syntax to query this data filter it aggregate whatever I need to analyze the data and take a look how easy it is to query PAR file you just right click on it generate this uh dummy select statement and that's it it's working out of the box I didn't have to bother with creating waiting an SQL pull just to make this work great but then you might think all right but you said that this curated minifix signups directory it is a Delta and it is and we know that in case of Delta we shouldn't focus on reading a specific file because maybe it is no longer valid file we have this Delta lock that maintain changes made in our data and maybe based on the lock we should read only the second file so I don't want to read a specific file I want to read the whole directory as Delta not as par is it possible yes finally in serverless SQL pool we are able to read Delta format which is not possible in dedicated SQL pool so how to make it work well we just simply need to change the format of our of our data that we want to read and if we try to run this it will fail it will fail because well it knows that in case of Delta we have to indicate not a specific file but just a directory because that's how we want work with Delta you point to a directory that you want to read so to make this work I'll just remove the part that indicates a specific file and I will read the entire minifix signups directory as Delta and it should work hopefully and then you'll see that all right so the data is there it was retrieved and because well we can TR this results set as a view as a table so I could write my query on top of it for example let's see how many rows have been read from this minifix signups directory and we've got 14,000 and that's the value that we've seen many times so far so it is a proof that indeed we were able to read all of the data so actually let me start making some notes about this first use case so we've got this open Aros set function super useful and actually it can read three types of files so it can read csvs we will take a look at it in a second it can read par that we saw and it can WR Delta finally a service that is able to read Delta and you should remember how many issues we had last time when I tried to copy data from a data Lake that was saved as Delta to Dedicated SQL pool because many services cannot understand Delta and serverless can partially unfortunately because well the support for Delta is quite limited so let me jump into documentation that tells how we can quer Delta like Delta format using sever less sqol PS in signups and unfortunately the first important statement here this one tells that well it can query only Delta leg version 1.0 that is pretty old to be honest and starting from Delta Lake 1.2 well serverless is not able to understand them so you will not be able to make use of those new cool features that are in introduced to newer versions of dat Delta Lake and you should remember that Delta lake is this open source project that is developed and released independently from dat data bricks or signups it is just a project on its own and if we take look at this Delta Lake version 1.0 or 101 actually you see that it was released in 2022 so quite old version and currently the newest version is 4.0 which is in preview right now so quite a version is only supported and what Microsoft says explicitly is hey guys if you want to use newer versions of Delta then switch to Microsoft fabric sorry which should also give you a hint if Microsoft plans to invest a lot into signups here they say that go to fabric just forget about signups all right but at least there is this functional functionality to read Delta so at least something for us all right so we know that it can it can read uh parka and Delta so how about csvs so I should have some CSV file here somewhere yep we've got this minifix CSV that's again our minifix that were transformed and saved as CSV so let me run and script or generate a script that would quer this file and let's run it on our serverless SQL pool and yes I was able to read the file but it doesn't look nice so first of all take look at the column names C1 C2 C3 and so on and actual column names that are included in the file they were treated as just regular columns so no we want to fix this and the way to make this work is to include header row parameter equal to true and this time serverless esql pool should through the first row from the file as a header and it should use those values well as column names so let's see if that works and yes it works now this data looks better so we've got those nice column names all right but how about data types you know that CSV is a text file so there is no data type unlike in par or Delta that store those metadata information so basically serverless S Pool in case of csvs it reads first 100 rows and based on values it tries to guess the schema and usually it gets it right but if not then you might want to provide the schema explicitly for example you might say that hey I want to read maybe only this column which is of date type and maybe on the number of Parts which is of integer type or whatever and to make this work you can again use this cool open R set function great function and we'll just expand it with a statement called with where we can provide the schema of our data so let me move it aside a bit and let's say that well I want to read only last modified date column and I know it is it uses date and date data type and maybe this number of Parts which is in and that's it I can I want to skip all remaining columns so let's see if this will work and yes take a look we read or we extracted only those two columns and this time we know that their data types are set are set correctly because we did this explicitly here in this with statement so that's the way to make sure that the structure that the scheme of the data you read will be exactly as you want now this open row set function is really cool so the question is why didn't we use it in dedicated SQL pool to read Delta files well simply it's not supported in dedicated sorry it is available only in a serverless SQL pool so let's write this down that unfortun nly it is only in serverless then we know that well we can use this we part to define a schema of the data that we want to retrieve all right so is there anything else that is important about this open roed function and actually there is so let me show you the documentation which is this one how to use this open roasted function and here we've got a diagram that shows structure of a some sample data l so we've got this web data directory which contains some my data takes takes the file and what is important is that it also contains some subdirectory with some data in it and the thing about serverless and this open Row open row set function is that by oops sorry by default it doesn't read sub folders so if you would uh query it to read this web data directory it will effectively read only this direct child this direct file that leaves inside this folder it will ignore those fold subfolders and their children unless you use this explicit syntax to read everything which is which might surprise you because personally I would expect it to read everything just Traverse the hierarchy but it doesn't work this way so actually let's write this down that this open row set function it doesn't quer subfolders by default so doesn't query so folders all right but that's fine for this adhoc data exploration it is fine because usually we would just explicitly indicate which file we want to analyze and analyzing a file with SQL statement as if it was a table is super easy especially if someone knows SQL so this is a really good use case for serverless and please remember that this serverless SQL pool it is available out of the box we didn't have to we didn't have to create it it was instantly available all right so that was the first use case for serverless the second one is to well use it to create this logical data warehouse logical data warehouse oops warehous which is basically a lake house so if you recall one of the First episodes when I was talking about different architectures then what I mean here is this Lakehouse one so Lakehouse and that's the scenario when we don't want to have this additional relational data warehouse available after our data Lake no because simply well we have data already available in a data lake so what's the point in copying this data over to yet another layer and if you would like to position this on our diagram so it would look like this so that's a diagram that we've been talking about in case of dedicated so it shows those various options we have that we discussed so far so we start with ingestion of various data sources then we transform it on a data l so in the last in the last layer of a data L we have our data transformed and start as start schema using Delta format and what we did prev L was to load this data to Dedicated SQL pool physically copy it and store in those tables that later were queried by our users or by powerbi and that's fine in some cases that's that is a good approach so actually we could query it in many ways now what if we would like to modify it and don't want to have this additional physical layer and simply query the data from a data L directly but in a way that well we remember that not every data consumer can understand Delta so we would need to have some kind of a translator in the middle that would from one one side be able to talk tsql from the consumer point of view but also it could be able to connect to our data link and read this Delta format and that's where serverless SQL pool can be used so here let me draw it so here we would use server serverless SQL pool and how it would work well it will again connect to our data Lake that St data as Delta format but we will not copy the data physically no and in a second I will explain what it means and we would just expose it to our data consumers all right but well I put those two Esq pools in the same place on our diagram so what's the difference and the difference is huge so in the case of dedicated we copied the data physically from data Lake to our SQL pool in case of seress we don't copy this data so basically when someone queries our sess SQL pool using tsql this query is actually sent to the data l transparently so the data consumer is not even aware that the data leaves somewhere else then data is read as Delta and returned on the fly to the customer so from data consumer point of view well it speaks tsql the same language that every tool auditor uses and cess SQL pool is used to translate those requests to those formats that are used to store the data but data is not physically copied to SQL pool no it exists only here in a single place so that's the biggest difference in those two architectures all right so how to make this work how to make sure that well we could use this logical data warehouse together with serverless so let's go back to our use cases and you should already have a clue what we could use we know that serverless has this schol function called open ret that can be used well to CER data that lives on a data l that uses tsql quer so yes open row set that's the way to go just to make this data available now what if you would like to hide this complexity because here in those statements you have to use this open row set function you have to know those details about the location of the data how it looks like well super easy just create a view that has this open roast as a body and then expose views to your data consumers so they would just treat it as a regular View and the logic would be hidden inside the database view so super simple so we could use just view and finally all right so open ret is great for this adhoc data exploration because we don't have to create any of those objects like we had to do in case of dedicated like external data sources external tables but can we do this and the answer is yes you can still use external tables if you want to have a more structured approach to accessing your data external tables so this is still supported so actually it means that well you can use two methods to connect your data L either open row set or by using external tables so which one to choose one and actually there is this nice article in the documentation that will be linked to the video description that compares those two methods and gives a high level overview of them so you will be able to see which one works better in your case all right so let's create some of those logical data warehouse objects so actually I will use a sample that is provided in documentation so there is this setup script that creates multiple objects so let me simply copy it and let me run it on my serverless SQL pool but I have to first make sure that I will create a new database that will work on the serverless pool so I will use just create database state M like serverless test let me create this database all right and then I will run this setup script in context of this newly created database so that's the one so one more time I will connect to my serverless SQL pool to my compute power but within this this SQL pool I will use my newly created database this serverless test and from this setup script I know that I want to remove a view that works with Cosmos D because I don't have Cosmos DB in my environment and I know it will fail all right so let me run this it is creating some resources and let's see what happened what are the outcomes of this script so let me switch to workspace top of our data let's see what databases we have available in our signups workspace and right now it shows only this dedicated SQL pool that we've been using previously so let me refresh the view and this time we can see this newly created serverless test uh data datase and what was created by the script are some views and external tables so for example we've got this CSV Yellow Taxi view whatever it is so actually let's see what this view returns like select top 10 rows from CSV Yellow Taxi top 10 star that should work and basically I just retrieved some data from some View and me as a data consumer I don't know where this data is coming from I don't care I'm just quering this database the same if we quer it for example one of those external tables like Delta Co so let's do this select top 10 star from Dela Co again for me for a data consumer it is just an object inside the database that I'm querying so let's see how those objects look like so let me go back to our setup script and let's briefly go through it so first of all it removes some objects if they existed in the past cleanup part basically so we can ignore this and later on what it does well it creates some credentials that are used to get access to data legs and those data legs to which I will to which those objects connect these are not mine it's not my data l it is some Microsoft data L to which we've got to authenticate and the way to do this is to use this shared access signature option then it creates some database schemas and then it creates those external tables so we know that we have to go with external data source data formats and so on so here for example in the first external data source you can see that it query is some storage account called SQL On Demand Storage and this is not mine it is a Microsoft demo storage likewise this aure Open Storage storage account again it is not mine it is some storage account that exposes some New York Taxi data that will be reading and then the script creates some external format in this case it is for csvs with different properties then it creates a file format for take a look for Delta and that's yet another difference between serverless and dedicated SQL pool so serverless can read Delta but not only by this open red function but also by using external Tables by poly base which which was not possible in case of dedicated SQL pool but here it works and finally well they creating some external tables using those objects created previously so the CSV population table it reaches some data stored on some Microsoft storage account so this data doesn't live it is not stored inside my serverless SQL pool no it is somewhere else and my pool is just used to reference this data and read it whenever it is needed and the same with views which actually take a look they encapsulate the logic that uses this open roset function so again those lowlevel details are hidden in a view def and for users well we just expose this CSV Yellow Taxi view you can use it to read the data but under the hood they use open roed function now one additional cool thing about open roed function I know it really cool function is that we can use those wild cards those Stars so basically data on this SQL on demand demo storage account it is partitioned by year and months and if you don't want to have to provide those values explicitly you can use those asterisks that mean just year equals to anything just like in in ious command line scripts but what is even better is that later on you can reference those values by using file paff function like it is done here in this View and basically file path one means just return the first wild card that was used by the open ret function so in this case it would be this value after the year which we know is basically a year that's why it is called well year the same get me the second Wild Card which would be a month and actually if we take look at the data returned by this Park Yellow Taxi view so let me copy this and let's run it like this we should see those year and month values somewhere somewhere here so these are the ones that were read based on a file path on a on theories so this is really cool and they also use this with syntax to explicitly provide a schema of our data so that's how it works so basically you can see that while by creating those either external tables or views with open ret logic we can expose data to SQL queries data that leaves somewhere else which is super cool so it means that some data consumers would be able to read a data from a data Lake via this serverless SQL pool and actually let's verify if I would be able to connect to read this data using management Studio I have it installed here and let me try to connect so let's just wait for it to start and obviously I have to connect to some server so what server name should I use and the answer is simple so let's go back to our signups workspace to the overview blade and here we've got two end points available the first one on the top is the one that we used when we wanted to connect to Dedicated SQL pools the one below is basically the same it has this on demand suffix so that's the one that we would use when we want to connect to serverless pools so let me copy this and let me connect to this uh to this endpoint and hopefully I will be able to connect obviously I have to authenticate I'm using enter authentication and I'm in so take a look I can see my serverless test database and I can see those objects that are a part of this database like those views and tables so for example let's query this CV Yellow Taxi so let's run this and there it is I was able to read the data the same if I would like to read Delta Data by using management Studio it should also work and yep it works so it is approv that ined we could use the those serverless SQL pools as this intermediate layer between a data Lake and data consumers without a need to physically copy the data by the way did you notice that we've got something more we've got this P demo database here so what is this so basically this is our database that we created when we were playing around with spark SQL so let me go back to signups this leg database pmo so that's the one that we see also in man men studio and this Lake database well this was a way to register our data that we were saving on a data Lake to register it in this signups catalog so we could use SQL statements to refer to it so it means that well from if we see this database here and the same you can see here in this database speaker that we can actually query those L databases so we can query the data that lives on a data Lake that we created using signup spark from serverless SQL pool which is again super nice uh how it is called P demo then we've got dbo schema and Minix Delta for example and I should be able to query this from serverless SQL pool and it works so basically it means that spark pools and serverless pools inside signups they share the same metadata and we can read those objects those tables and databases created by Spark from serverless SQL pool which is also nice especially if you are using the same tool signups analytics to perform all of the steps of our project all right so that was the Second Use case for serverless this logical data warehouse and the last use case that I would like to talk about is about well quick and simple transformations so quick and simple Transformations all right so let's say that the data that we have in a data Lake well we would like to transform it a bit maybe they are not in a form that you would like to expose for some quering purposes so how can we do this well we know already at least some of capabilities so we know that for example we could create views inside our serverless SQL pool and inside a view we can define a logic that would transform our data somehow so that's the first way you can do this and Views well maybe they wouldn't work in case of super complex Transformations but for some simple ones like some column selections uh renaming some data conversions it it would work so views now all right but what if we have more complex transformations to be done cannot we just must save the results of those Transformations as a table inside serverless SQL pool and the answer is no and this is a big difference between serverless and dedicated in serverless SQL pool there are no normal tables they don't exist so no normal tables and by normal table I mean just a table that we create in a regular way that stores data physically inside our database and even if you expand the properties or the list of features inside and this serverless SQL pool this database there is no place where those normal tables could be created all we have to do are those EX external tables only that store data externally so serverless they cannot be used cannot be used to persist your data to store your data in a database no this feature is not available so that's a big difference compared to a regular dedicated SQL pools where we were able to create a table and save the data there no in a seress it doesn't doesn't exist all right so what if we would like to save some data let's say we have some view or some select statement and you would like to well persist the data somewhere is it possible yes we can we can use something called c t which means create external table as select so we can save the data but not inside our database but in a data L basically we will create we can create external table so this c c TI create external table as as select simply means something as follows so we've got some query inside our serverless SQL pool that we would like to save its results somewhere and this somewhere in our case well it is a data lake so that's the only place to puras the data on a data Lake and it would create an external table hence the name create external table as select and please do not confuse this with CT that we used in case of dedicated SQL pool because ctis it was create table as select not external table but regular table and actually it was the opposite operation because then what we had was like this so we have we had a data in a data Lake we had external table defined and what we wanted to do was to create a regular table and populate it with data based or data read from our data link so create table as select and the select was select from our external table which means select from a data l so different directions in in those two two cases and now so basically that's how serverless SQL pool differs from dedicated SQL pool so we've got those three main use cases and also you should remember that serverless well its setup is super easy because well it simply exists automatically whenever you have created this signups workspace then this sess SQL pool is up and running is ready for dedicated one you have to create it explicitly on the other hand when it comes to Performance this serverless pool it will be usually slower because every time you query your data it has to go to this data Lake read the data Maybe perform some operations during the select phase and then return the results in case of dedicated we know that we've got those distributions that we can optimize for our queries we can distribute the data we can index it in a way that would speed up our queries so basically dedicated SQL pool will be probably faster to response to respond to your queries than several less all right and basically that's it so to summarize serverless SQL pool is yet another flavor of SQL pools that we have Avail available inside signups it differs a lot from dedicated SQ pool first of all it is serverless so it is available instantly you don't have to provision it and you pay for what you use you pay for the amount of data that you processed now it has those three main use cases so it can be used for this ad hoc data exploration when you have some data in a data Lake and you just want to start quering it just to see what's inside and with serverless it is super easy then you can use it to create this Lakehouse architecture when you don't need need to store your data from a data l in yet another copy in some database no with serverless you can just read it using tsql and serverless will go to your data Lake and get the data but it will be probably slower than dedicated SQL pool and finally you can also use it to make some quick and some uh simple transformation that you would embed as a part of Vios or the C statements all right and that's it for today thanks for watching and see you next time take care

Transcript for:Understanding Serverless SQL Pool in Synapse

Transcript for:
Understanding Serverless SQL Pool in Synapse