Snowflake Core Concepts and Functions

hello everyone welcome to the 7th video on the snowpro core certification Series in this video we are going to discuss about file formats sequences streams and tasks within snowflake without further Ado let us see in detail about each and every section first let us talk about the file formats file formats are the named objects that can be used for loading and unloading the data into the snowflake there are currently six formats which are supported which are CSV Json AO orc par and XML formats that are supported for both loading and unloading of data are CSV Json and park tactically all the six formats are supported for data loading only there are some Kats for the data unloading alone that is the reason why that is mentioned as for both Lo loing and unloading as CSV Json and Par AO orc and XML are not supported while unloading the data from Snowflake Right Moving on there are multiple ways by which we can create the file formats within snowflake one typical way is using the snow site which is the snowflakes web UA and also using the SQL as well we can create the file format there are lots of options which are available while creating the file formats like compression skip header skip blank lines trimming the spaces Fields optionally enclosed by field error on column count mismatch Etc to understand more about this this is the snowflake documentation page which provides the complete set of parameters which are part of the create file format SQL command here we can see all the statements which we discussed earlier are mentioned here so please Glimpse through this I will add attach the link for this documentation on the description on the video as well and then the schema defined in the file format must match the schema of the snowflake table this is very evident while creating the external stable statement if the schema of the snowflake table doesn't match with the schema of the file format then the external table create statement will fail partitioning loading meaning we want to load some subset of data from the bigger chunk of data that is supported only for three file formats which is orc Park and Json here are the set of file format create statements and the stage statements please have a look into it so this is the file format create here we are creating my CSV format as the type of the CSV which is delimited by pipe pipe is the Del meter and then we are going to skip the first header of that file Now using this file format we are creating this stage my CSV stage is created using my CSV format and similarly we can utilize the storage integration as well while creating the stage but the file format Remains the Same then using that stage we will be loading the data into the table using the copy into command so this file formats are very very important while staging the data into the stages here is the example for the similar Json so more or less same as we saw for the CSV here we are defining it as Json and then using the same file format we are loading it into the my Json stage and then we are loading the my Json table using the my Json stage now we are going to discuss about the sequences within snowflake sequences is an interesting feuture which is predominantly available in most of the databases systems to generate an uniquely identifiable value which is an continuously running value as well so sequences are used to generate unique numbers across sessions and statements including concurrent statements user to generate the values for the primary key if you remember typically on the data arousing World primary key is the one which uniquely identify the column or a row on the table typically the employee ID can be the primary key for the employee table meaning the employee ID will be unique for each and every record in those use cases we can very well utilize the sequences snowflake does not guarantee generating sequence numbers without gaps the generated numbers are not necessarily contagious we are going to see a detailed example for this now coming to the SQL statements create or replace sequence sequence it will start with one and it will increment with one so we are creating a sequence table as well just for loading purpose and see how it goes so now we are executing select sequence one. next Val it will give me the first value which is one now if I can execute the same statement again it will give the next value of two it is getting incremented by one now I'm going to insert some of the value the whatever the value there in sequence 01 the next value into this table now try to guess what is the value which is going to be available in the sequence table sequence test table yes it is of course three why because at the time it was two then we used the next Val it is the three so the value of three got inserted into the sequence test table here we are trying to do something different instead of incrementing it as one we are going to increment it with f so you can fire a statement like this sequence of next Val of next Val of next W or next Val so four times we are running it so each and every time it is getting incremented with the multiples of I it will start with one and then five got added it will be six and then another five got added it will be 11 then another five got added it will be 16 now if I fire a same statement again our expectation is mostly 21 why because 16 + 5 will be 21 instead of 21 it started with 36 that is the reason why there is a clear there is no guarantee for generating the sequence numbers without gaps the generated numbers are not necessarily contagious now we are going to discuss about tasks within snowflake in simple ways to represent task tasks are the mechanism within the snowflake which will be helpful to schedule something something meaning if you want to schedule a job if you want to run a SQL statement at a specific interval of time all sorts of scheduling can be achieved using tasks within snowflake so you can execute single SQL statement you can call a store procedure within the task you can procedural you can write an procedural logic using snowflake scripting using SQL so any means of any code can be done using task task can be combined with streams for continuous El work flows to process recently changed table rows this is all about the change data capture all part of the snowflake streams which we are going to discuss in the same video there are two different tasks which are currently available one is the serverless task where snowflake manages the compute part of it so for the task to run we need a compute to be added to it so that compute is completely managed by snowflake when it comes to the serverless task the other form of task is the user manager task where we need to specify the compute by providing the existing virtual arrows as the part while creating the task now we can see this in more detail using the SQL statements here is the example for the serverless task so the task T1 which is supposed to run every 60 Minute is going to insert into the table of my table with the values of current Tim stamp so at every 60 Minute whatever the current time stamp will be inserted into this table now the user task manager initial vrow size is X small this is very very important using this statement only we can identify whether it is a serverless task or a user manager task so it will start with X small as the ideation or initiation phase and snowflake will automatically increment the size of the wouse based on the workload for the task here is the example for the user manager task here we are specifying the vrow as mywh very clearly at the time of the task definition so my task however it is going to run as a user manager task with the arrow of mywh and the schedule it is using the crown we can very well utilize the crown to do the schedule Crown is the well-known mechanism for scheduling which is prev for a longer duration of time it is very evident on the uni and Linux based systems we used to schedule lot of jobs using the crown tab similarly we can schedule the snowflake jobs as well using the crown expression here it is just doing the select current time stamp so the bottom statement is more or less same as the one which we did on the serverless task it is going to run for every 5 minute using the my arrow and it will take the current time stamp at that time and it will insert it into the my table for the task there are set of workflow which we need to follow workflow meaning how we can create a task for a task to be created there are three set of steps which needs to be followed first is the creation of the task admin role where we will be creating a task admin role which will grant the specific user or a specific role to create a task then what we will do we will be creating the task using the create task statement by default the task which is created is suspended so the initial suspension is an important point to remember here we are creating my task minute which is using the mywh which is an user manager task which is going to run for every five minute as usual it is going to insert the current time stamp at that time into the table of my table then we need to fire the resume command because this task is created in the suspended mode while creating then we need to use alter task followed by the task name and then the resume command to resume this specific task moving on we are going to discuss some important future and interesting future of task which is called as the task graphs we can combine the tasks together meaning I can create one task which is dependent on another task which is dependent on another task by doing so we are creating an chain relationship between the task that can be done using a task graph which is called as the dag which is the direct ayli graph if you are from the data engineering background you are very well aware of the Dax which is part of the airflow where in which we are doing the same set of procedures so it is a series of task that is composed of a single root task and additional task organized by their dependencies flow will go in a single Direction meaning a task later in the series cannot prompt the Run of an earlier task that is very important thing to know so the flow will be going in an one directional way alone and then each task except the root task can have multiple predecessor tasks right this we are going to see in more detail each task can have multiple subsequent task meaning the child task that are dependent to it a task runs only after all the predecessor task run successfully to completion this is very important so the child task cannot run until unless the parent task got completed the root task should have a defined schedule that initiates the run of the task graph so for the any of the root task we need to define the schedule you can specify the predecessor task while creating a new task uh usual right we created a task using the create task statement by specifying the after command or else what we can do if you want to add the dependency later point in time we can very well use the alter task command and you add the after statement to it a task graph is limited to maximum of th000 task in total including the root task a single task can have maximum of 100 predecessor task and 100 child task these are all something which is very very important from the exam point of view now you can see this diagram this is very well took from the SNL documentation a is the root task B and C are the child task which is depend dependent on a meaning only when task a got completed then only task b and c will be initiated D is dependent on both B and C meaning until both B and C task card completed then only the task D will get initiated for more meaning understanding you can see here this is the root task and these are all the child task updating a customer table updating a product table updating date and time table once all these three task got completed then only the aggregate sales table task will get initiated now we can see the command which is used create task task five after task two task three task four then we are going to insert into the table using the current time stamp this can be very well represented in this diagram if you see task five is completely dependent on task two task three and task four now we are going to discuss something called as streams within snowflake this streams is all about change data capture what is a change data capture I am having a table on that table I am going to do some operations some insert some delete some update based on those operations I want to do some other operations on some other table those sorts of things are called as the change data capture that is very well added in Snowflake in the form of the streams so what streams is all about streams tracks for the changes made to the tables views external tables and directory tables within the snowflake data platform they capture inserts updates and deletes using the DML statements ideal example is the change data capture in the data pipeline this diagram again I took from the snowflake documentation will very well explain how this streams work if you see the source table there is a version zero version one version two version three and after version three we started our stream stream one then what it does after that there is version four version five version six till version 10 meaning there are various changes which happened over the table and now what it does this CDC stream contains all the changes from the version 4 till the version 10 the streams comes with three main important Fields actually metadata action metadata e update metadata row ID we are going to discuss more in detail about it first let us understand what these things are metadata action indicates the DML operation typically insert or delete type of an operation happened over the table that will be recorded in metadata action metadata is update is a true or false field which is said to do true for any of the update statement typically in the streams the update will be taken care by twostep procedure update rows to the main source object are represented as a pair of delete and insert records so how it is represented it is represented in two records in delete and insert both the records will be having a metadata e delete Sorry metadata is update value said to True last one is the row ID there is no of a significance for this this is the uniquely identifiable ID which is immutable which is used to track the changes over the period of time so coming to the other part which is the types of streams there are three types of streams standard stream standard streams tracks all the DML changes to the stores object including inserts updates and deletes there is the up and only stream exclusively it will track roow inserts alone update delete and truncate operations are not covered insert only streams are also there which tracks the inserts only do not record the delete operations that remove Rose from the inserted set to understand more there is a fantastic diagram which is available in the snowflake documentation I successfully stealed this diagram from the snowflake documentation as well now we can see there is a table with this is the version one of the table which contains the only one record which is the apple as the column one and column two as F so streams got enabled on the table now the insert happened over here for this apple record it is mentioned in the metadata action as the insert right and then the metadata is subser is update is false and row ID is the unique number now what we are going to do we are going to insert two new records which is orange and banana and we are going to update the Apple record with a new field of n now the f is updated with n now you can see how it is getting changed in the streams orange is an insert banana is an insert the update record of Apple n as we discussed earlier there are two records which got inserted with the metadata e update field as true which is a delete and then the insert you can see Apple 5 got deleted and apple 9 got inserted and finally we are issuing a delete command to it typically you can see the blacks are insert red is the delete and blue is the update now we are firing a delete for the banana record that is again the metadata action holds the delete as the function and then the metadata ease update flag is false so these two things metadata action and metadata ease update is very important you can now I think you can clearly understand how these two Fields pans across for any of the inserts any of the deletes and any of the updates now the important feuture which we talk as the part of the uh tasks actually where we discuss this can be combined with the streams now this is the place how we are combining the task along with the stream you can see this here create my task it is using the user defined task using the myw as the warehouse it is going to run for every 5 minute and you can see system stream has data My Stream so whenever a data got inserted into the stream of my stream then only this task will be created or initiated and it will insert the data into the my table based on whatever which is there see now it is inserting only the insert record from that my stream into the table of my table one now we successfully combined both the task and streams together without concluding the video I just wanted to go through through the set of documentations from Snowflake which is very very helpful for the exam point of view which we already saw the file formats and with respect to the sequences here is the set of documentation which will help you to understand how I can create the sequence what are all the various parameters which I can add as the part of the creation of the sequence everything and this is all about the tasks whatever which we saw serverless task user manager task how you can choose the arrow sizes all these things recommendations for choosing the compute model when you can choose serverless task and when you can use my user manage task and the task scheduling how we can schedule the task the task graph all these things are clearly mentioned here please Glimpse through this and similarly for the streams portion as well here you can see the table versioning how we can do the repeated read isolation all this metadata eaction update and row ID what are all the types of the streams how the data flow looks like how we can do the data retention period and stainless everything is mentioned here as the part of this documentation with this we came to end of this video I hope this video has been informative for you please do write lot of comments that will help me to enhance the contents for this videos thank you very much for watching this video

Transcript for:Snowflake Core Concepts and Functions

Transcript for:
Snowflake Core Concepts and Functions