End-to-End Machine Learning using Snowpark Python

hi everybody my name is Caleb bechtold and I&#39;m a machine learning and AI architect in the field CTO office here at Snowflake and today I&#39;m excited to walk through a lab with you on end-to-end machine learning using snowpark python so let&#39;s go ahead and get started before we get started looking at all the code and all the juicy good technical stuff I do want to spend some time talking through what we&#39;re going to build today and some of the various components that we&#39;re going to use along the way including things about like what is really snow park and how does it work under the hood so before we get going on the code let&#39;s go ahead and spend a little bit of time kind of talking about what we&#39;re going to produce today at a high level what we&#39;re going to build today is a scalable and secure machine learning workflow that executes entirely on the snowflake data Cloud no data movement no egress of data to some external platform it&#39;s all going to run inside a snowflake we&#39;re going to orchestrate this workflow using a dag in Apache airflow that&#39;s primarily leveraging snow park python under the hood to incorporate all their various mlops components around data lineage and tracking reproducibility automated execution all that kind of good stuff the last thing that we&#39;re going to also show is a compelling and explainable visual app using streamlit and snowpark that allows us to expose the underlying models and insights that we&#39;re building along the way to an end consumer or a business user for our particular use case so something that&#39;s very exciting about snow park python in particular is the ability to use it in other orchestration and external tooling Beyond snowflake to orchestrate and manage your data pipelines and your machine learning workflows without having to move data back and forth between Snowflake and some sort of external platform certainly airflow is not the only option for how to do this today but it is a very popular one and so we&#39;re going to use it over the course of the demo here to orchestrate the dags and pipelines that we&#39;re building along the way as part of that we&#39;re going to show how kind of a whole collaborative development cycle for an ml workflow comes together using stowpark and how that can ultimately be operationalized in an airflow Dag today we&#39;re actually going to show using one of our partners tools over at astronomer who provide among many other things managed airflow instance to customers and how the Astro cloud and Astro CLI tools that they&#39;ve developed can help integrate with Snowflake and allow you to very very easily build patchy airflow dags that ride on top of snow park code and ultimately execute on the snowflake server-side runtime environment extending all the benefits of snowflake and what it provides to existing SQL workloads to your python data engineering and machine learning and machine learning Ops type work as well while leveraging kind of the best of breed tooling that products like airflow provides and particular the Astro tools that have been released on top of Open Source Apache airflow so one thing that we&#39;re really excited about at snowflake is the growing prevalence and excitement around streamlit and in particularly what Streamlight and snowflake together can do we&#39;re going to at the end of this demo actually visualize and consume the resulting machine learning derived insights that we&#39;re producing via a streamlined application again doing being able to do this without having to extract and egress data to some sort of external system in order to serve it up to some sort of application user so all the underlying forecasts and insights that we&#39;ll be producing and the data associated with it is all going to be living inside a snowflake but consumable through a very nice clean friendly easy to use web application without having to extract data into some sort of external database to serve that app for a little bit of context around the demo here we&#39;re not going to be able to cover the entirety of snow park in this session the reason is just that there&#39;s a lot you can do with snow park and we&#39;re not gonna be able to get to all of it but some of the things that we are going to highlight that we will kind of you&#39;ll see along the way here will be the snow plot client data frame API which we&#39;ll use to perform our data ingest data prep future engineering sort of tasks associated with this workflow we&#39;ll use the snow park python server side runtime and snowpark udfs in particular to do our model training inference and evaluation tasks which ultimately will get served back up to some end user through our streamlit application we&#39;re also going to use custom libraries that are not natively supported by our Anaconda integration we&#39;ll talk a little bit about the Anaconda integration and what that affords us but just to demonstrate as this is something we&#39;re seeing with customers as well who have homegrown packages that they want to build on top of and use inside of snow park we&#39;re going to show you what it looks like to use those custom libraries and ultimately show what end-to-end ml apps and orchestration of snow park capabilities in Snowflake looks like like using native snowflake internals as well as tasks with Apache airflow and then serving up our model results model explainability and monitoring sort of metrics through a streamlit web application the real purpose of this is to show the art of the possible using snow park today there&#39;s many many ml tools throughout the ecosystem that provide varying kind of capabilities throughout and across the whole spectrum of maturity with machine learning and data science snow park is one of those options for snowflake customers and we&#39;re excited to show kind of what does a fully built solution with snow park look like today the other piece of this is that this lab is designed to represent what the real world use cases look like for doing this kind of development and deployment sort of cycle in particular we&#39;re talking about data that&#39;s on the scale of what snowflake customers working with we&#39;re talking about data that has to be governed and secured and where we have to use role-based access controls to manage various users interaction with that data and the development of this capability is meant to sort of mirror what the whole team-based development cycle looks like so we&#39;ll have multiple users that might be using various components of code that we&#39;re developing and that are responsible for different tasks throughout the life cycle and so as a result you&#39;ll see kind of how that all comes together using snowpark python today the other piece of this is that it&#39;s modular so each kind of tasks associated with the workflow is split up into its own module so you can start where you want backtrack revisit certain topics if you&#39;re very interested in the future engineering for example you can really focus time there if you&#39;re interested in the deployment and operationalization you can focus there so this is really designed to be you know interactive and give you the ability to revisit things and backtrack and do all that sort of stuff we&#39;re going to cover it as much as we can in the lab today but certainly I imagine there will be kind of additional time you want to spend after this looking at some of these components in more detail so one of the questions before we get started that you might be asking yourself is what is snowpark actually there&#39;s been a lot of excitement and Buzz around snow park python as we&#39;ve made announcements of public preview earlier this summer at our snowflake Summit but I think there may be misconceptions out there in the world about what is snow park actually and also what is it not so fundamentally snowpark encapsulates our development framework for extending snowflake support Beyond just SQL workloads snowflake was born out of the Enterprise data warehousing world and says SQL was where we started has customers consolidate on snowflake as the data cloud and their data platform for all their various data needs across applications business intelligence machine learning and data science SQL alone is not enough to support all those various users and use cases and so snowpark is our Endeavor in extending the power of snowflake to non-sql users and non-sql workloads snow park has two fundamental components the first is a client-side API in the case of python which is what we&#39;re going to talk about today you can think of this as just a library that you pip install into any sort of python environment anywhere you can run a python kernel you can connect to snowflake you can use the snowpark python API and this allows you to do data frame style Transformations on data inside a snowflake execute and compute on snowflake virtual warehouses without having to do any data extraction to your client or use any local compute resources to perform that work we&#39;ll dig into more detail into how that works in just a second the other piece is the server side runtime that snowpark provides in particular what this allows you to do is execute arbitrary python code via snowpark python stored procedures user-defined functions or udfs user-defined table functions or udtfs and push that arbitrary code to run on snowflake virtual Warehouse compute resources again without having to extract any data or restricting yourself to only workloads that SQL can support in the case of machine learning and data science one kind of easy example to think of to help understand this is deploying a trained machine learning model into snowflake using snowpark python to perform batch offline inference the python runtime inside of snowpark allows you to do this at scale on data in Snowflake via python UDF and perform batch inference on that data using our compute infrastructure without having to extract data to some other environment or stand up rest apis to host your models behind API endpoints or anything like that that that&#39;s an easy kind of example maybe to understand how this will work and we&#39;ll show throughout the lab some various elements of the server-side runtime that we&#39;ll use a fair question would be that sounds great but why should I actually consider using snowpark a lot of it has to do with streamlining architecture you can run python Java JavaScript in all kinds of different compute environments but when you run you know python in some sort of AWS environment or something like that you&#39;re fundamentally become responsible for maintaining some separate set of compute infrastructure and resources there&#39;s some overhead associated with doing that what snowpark allows you to do is simplify that architecture to be totally snowflake Centric you don&#39;t have to manage additional compute resources but you don&#39;t have to deal with movement of data between various platforms all the data exists in a single platform and then we natively support all the various users programming languages of their choosing as well as the the kind of requirements of their workloads and and what they need to be able to execute effectively you also benefit from snowflakes core capabilities and features including our Superior price performance and near zero maintenance the same things that people love snowflake 4 from a SQL perspective today extend to snow park so you will get to experience that throughout the lab today as well and the other piece is governance and I know data scientists and machine learning Engineers hate hearing about governance because it&#39;s usually only in the context of some platform owner getting mad at you for not following policies or requesting access to data you&#39;re not supposed to see or anything like that trust me I&#39;ve been there but what snowpark allows you to do is satisfy your security and governance needs from an Enterprise level which is extremely important without inhibiting or restricting your ability to do the work that you need to do allows you to enforce consistent Enterprise Gary governance controls and security across all your workflows including machine learning and data science now that we have python support in snow park so what are some unique things about snowpark python specifically the first is extending familiar programming constructs to python users the data frame abstraction has become very very familiar and popular for python users and data scientists and that&#39;s what snow park for python allows as well extending that same sort of data frame abstraction to data that lives and resides in Snowflake and for Transformations that are executed on snowflake compute resources the other piece that I want to emphasize here is the ecosystem of Open Source tools that are provided to you in snow park python as a result of our Integrations with anaconda anaconda provides a managed repository of hundreds of packages and automated dependency management to snowpark users and customers and those packages are available to you both in your local development environments where you&#39;ll work and you&#39;ll use the Anaconda channel that is managed for Snowflake today as part of this lab but it also includes bringing those packages and corresponding dependencies into the server-side runtime to simplify deployment of python capabilities into the snowpark python runtime on snowflake compute resources for a little more specific info on the data frame API fundamentally what this allows us to do is query data that&#39;s inside a snowflake using python without having to write SQL you can see at the top here that dot style data frame kind of operation on data to do filtering and aggregations and column references and things like that that familiar API ultimately equates to 100 push down of compute into snowflake that gets lazily executed much in the way that you might be familiar with working with a pi spark API or something like that the benefit of that is that you get to achieve native snowflake performance and scale without having to manage again shuffling of data between various compute platforms or moving between languages trying to incorporate SQL into python-based applications or anything like that we now have a native API lets you work in a familiar syntactic way while achieving scalability and performance necessary when operating on data that&#39;s inside of snowflake python functions and the server-side runtime allow you to bring custom python code to snowflake through a couple of different constructs fundamentally the way this works is that code developed in some sort of other IDE or even inside of snow site itself gets serialized and ultimately pickled and then pushed down into snowflake to run in a secure sandbox python runtime across virtual Warehouse compute nodes again we also allow you as part of that runtime to seamlessly access hundreds of third-party packages through our integration with Anaconda including all of the most popular packages and Frameworks for your machine learning and data science needs this allows you to deploy functionality to run inside a snowflake at snowflake scale on data in Snowflake without kind of a significant engineering lift to you know incorporate some functionality that a data scientist maybe has put together into some sort of external you know production orchestration architecture and framework you just push it down to operate on data right into decide where that data lives using the same Python Programming constructs that you&#39;re used to already using really simple but extremely powerful just to re-emphasize one more time also our integration with anaconda and our partners over there make hundreds you know thousand plus packages available to you for both local development purposes as well as production level runtime execution inside the snow park environment and as we get going with the lab today we&#39;ll actually be using conda virtual environments to manage the compute environment that we&#39;re working in locally before deploying into the snow park server side runtime one other partner that I like to call out that we&#39;ll be using in the lab today is our partners over at astronomer who provide managed Apache airflow in the Astro orchestration Cloud to Joint Snowflake and astronomer customers will be in particular orchestrating our full end-to-end workflow using the Astro development tools and Astro CLI as we get through the the end of the lab here all right so let&#39;s go ahead and get started you can use the link here shown on the screen to access the quick start for this lab do you want to follow along as we go through it you&#39;re more than welcome to pull up that quick start has links to all the source code walks through the various sections of the lab step by step as well there&#39;s also folks in the live chat here to answer questions as we go so if you do get stuck we do have some resources on hand to help out the other thing I would say is if you get behind as we go through this lab today don&#39;t worry about it that quick start is there and available for you to access you&#39;re always more than welcome to revisit it once we wrap up today but of course as we go feel free to ask questions in the live chat and we&#39;ll we&#39;ll help get you sorted out so let&#39;s go ahead and get started so what you&#39;re seeing here is a link to that quick start page that I mentioned before this as you can see on the left hand side here goes through all the various modules kind of walks you through the core snowflake features that we&#39;re using along the way can be used to kind of follow along it also includes a link to the source code on GitHub here which you can click on that and go explore and this has all the various Jupiter notebooks that we&#39;re going to need along the way so we will of course set up an environment using all those notebooks and everything else but that link is included in the quick start as well ultimately what we&#39;re going to actually learn and walk through today is setting up an elt pipeline using snowpark python to bulk in just hundreds of millions of Time series records using an incremental kind of operational loading process all using the snow park python client API we&#39;ll use that same API to perform exploratory data analysis model development and experimentation and feature engineering again using that API along with the pytorch tabnet framework for our model training we&#39;ll then create a reproducible monitored and explainable automated pipeline for ML training and inference at scale and orchestrate that pipeline using Apache airflow exposing the results through a Streamlight application we&#39;ll be demonstrating with airflow but certainly what we&#39;re showing today is can be done with any other kind of python capable orchestration framework whether that&#39;s a metaflow or or some other popular tool what you&#39;ll need to get started is a snowflake account with the Anaconda Integrations enabled if you don&#39;t already have a snowflake account set up for this lab you can use the link here to go register for a free trial account which will have everything we need to get going the only thing you&#39;ll need to do is go into the billing section of that trial account and accept the Anaconda integration terms there you&#39;ll also need some sort of Jupiter notebook instance and conda environment I&#39;m going to be showing you today the lab from Amazon sagemaker studio lab account which is a very very handy lightweight Jupiter lab instance that our partners over at AWS provide if you want to you know do this lab at a future day you can of course request a free account with studio lab and run it there but you can also just run a local to Jupiter notebook instance using miniconda if you want to run the airflow components on your own you&#39;ll also need Docker desktop to do that it won&#39;t necessarily be required that you run that today we&#39;ll show you what all the airflow components look like but if you do want to run it you&#39;ll need Docker desktop as well and so here you&#39;ll just see a preview of what we&#39;re actually going to end up building so here you&#39;re seeing a visual of the airflow dag that we&#39;re going to build for our end-to-end work ultimately this is going to incrementally ingest new data to perform a monthly forecasting operation using snowflake snow park snow park python Pi torch and then obviously airflow to orchestrate this workflow one of the things that I want to highlight is that to do this forecasting what we&#39;re actually going to be doing is training hundreds of independent Pi torch models in parallel to generate new forecasts for updated models all inside of snowflake the thing I want to emphasize here is it will be doing massive parallel training and inference of deep learning models all inside the snowflake runtime on hundreds of millions of data records and all in a very performant and scalable way we&#39;ll also calculate and store model evaluation metrics and then flatten our output to be consumed and then displayed through a front-end streamlit web application like what you&#39;re seeing here the use case that we&#39;re going to be kind of tackling is built upon the city bike data set if you&#39;ve seen a snowflake demo throughout the years you&#39;ve likely seen this data set already once before ultimately what it is is a data set of trips from a bike sharing Network in New York City customers of city bike choose from thousands of bicycles across many many stations around New York City basically pay to unlock a bike ride the bike drop it off in another station and that accounts for a trip so the use case here is that City bike is looking to predict how many bicycles are they going to need can they expect to meet at each station how many trips do they expect customers to take from the various stations around New York City you can imagine that a maintenance team would want to make sure that there&#39;s enough bicycles located at each station that if there are bikes that aren&#39;t working at stations that are in high demand that they get repaired properly and all that good stuff so they want to relocate based on predicted demand so they want ultimately the Ops Team wants an application that&#39;s going to show them how many bikes will be needed at a given station on any given day and we&#39;ll build that application in this sort of Jupiter environment but we&#39;re going to be exporting the code that we build along the way in a modular way to be used in an end-to-end operational deployment that&#39;s orchestrated using airflow and what that really wants to we want to highlight there is it demonstrates the ability to use IDE of your choice so data scientists like working in Notebook environments but a data engineer an ml engineer I might work in some sort of other I IDE to build out these capabilities and so we&#39;re going to show how snowpark python allows you to make that choice yourself and rapidly iterate explore and experiment but still create production-ready Deployable python code all right so moving on I just want to quickly kind of highlight some of the setup that we&#39;ll do up here up front I&#39;m not going to walk through this code in extreme detail just for sake of time it should be pretty straightforward but ultimately what we&#39;re going to do is use a snowpark python session to create various roles warehouses Etc that you&#39;ll need in your snowflake demo account for the rest of the lab some of the things that this will highlight are snowflakes rbac model our separation of compute and storage and the virtual Warehouse concept the other thing that I would highlight is to make sure you check out the code repos readme and ensure that you&#39;ve followed the setup instructions for your local development environment whether that&#39;s studiolab or local Jupiter or something like that the main things to highlight are that we have a environment EML file in the code repo so once you clone this to whether it&#39;s again studio lab or some other environment there&#39;s a yaml file in there that you can use to set up a virtual environment to run all the various notebooks that we need so in the readme of the code repository we do have the setup steps that you&#39;ll need to get your local development environment set up it includes walking through the steps to enable the Anaconda Integrations in your snowflake account that&#39;s here at the top and then also includes instructions for setting up a Jupiter environment an appropriate kernel in the stagemaker studio lab if you&#39;re not using studio lab you can also clone this repository and use the same Jupiter environment.yaml file that&#39;s included to create a local Jupiter instance with the appropriate versions of python and all the relevant packages this just has the instructions for doing that easily in studio lab but of course you can use mini conda or conda or pip virtual environments or anything else to do the same thing so to get started in this zero zero setup.ipy and B notebook this is where we&#39;re going to first establish all the various warehouses databases schemas roles Etc that we&#39;re going to need throughout the rest of the lab the first thing that we&#39;re going to want to do is set up the user and account parameters here in this dictionary which are going to enable us to actually connect to snowflake using a snow park session so you&#39;ll update that with your username and account that you created with the free trial account and then we&#39;re going to save that dictionary as a Json file which we&#39;ll use throughout the remainder of the lab to continue connecting to snowflake using snowpark this dictionary and this Json file get updated as we go with some new fields and things like that but this is what we&#39;re going to use to manage connectivity to snowflake from our Jupiter environment the next thing that we&#39;re going to do is actually set up our initial session so our initial snowpark session to connect to and interact with snowflake to do that we just import the Snow Park Library as s p as well as some other things that we need we&#39;re going to have you enter in your password for that account admin user role that we listed above and then establish our first session here once you run this cell and enter that password you&#39;ll actually be connected to snowflake at that point and able to start creating virtual warehouses creating tables schemas Etc to help feed things up too we&#39;ve got an S3 bucket in AWS where the data that we need for this particular lab is pre-staged so go ahead and run these cells along the way too just to set the download URL for that underlying data so that we can ingest it into our snowflake account down below that we&#39;re going to set up the actual user demo and kind of lab username so you&#39;d imagine this is a particular user on your team when you run the cell we&#39;re just going to create that User it&#39;s going to ask for us to then in the next cell enter a new password for the demo user you can always use the same Jack user and password you create here to go to the actual snow site login for your snowflake account and then log in and explore in snow site but we&#39;re just going to set these things up from our Jupiter instance for snowpark usage the next cell just goes through those various warehouses that are listed in the Json above and it&#39;s going to create warehouses with those corresponding names with various different sizes all these warehouses are set to Auto suspend and auto resume so if we&#39;re leaving a particular Warehouse idle for some time it&#39;ll just turn off and won&#39;t incur you any cost of course if you&#39;re using a free trial account doesn&#39;t matter anyways because it&#39;s a free trial account we&#39;re also going to be using the snowflake Marketplace later on in the lab and so we&#39;ll Grant the ability to import shares via the marketplace to the role that we created we&#39;re going to update that Json file save it out and the last thing is we&#39;re just going to create a wrapper function that uses that Json file to establish a snow park session so this snowpark connect is going to get written to a DOT py file in Jupiter environment and then we&#39;re going to import and use the snowpark connect as just a little helper function throughout the rest of the lab to establish our snow park sessions go ahead and import that function and test it make sure that you&#39;ve got access to the various compute instances you should be all this is should be running successfully you should be setting up and building all the warehouses that you need and then we&#39;ll go ahead and close this particular session and move into the data engineering lab where we&#39;ll actually create the databases schemas and tables and everything that we need for our data all right so the first thing that we&#39;re going to do is our core sort of data engineering tasks to get data into Snowflake and then make it accessible to our data scientists to do that we&#39;re going to extract our city bike data from S3 into snowflake we&#39;re going to load the raw data into snowflake tables create some other tables that have transform data that our data scientists can actually use and then export that relevant code that we developed to perform these tasks as reusable python modules and functions that our ml Ops teams are going to use in their orchestration platform later on down the road the key features here that we&#39;ll be leveraging on the snow park side will be the client data frame API to do our elt type tasks and then also just highlighting snowflake&#39;s core ingest and capabilities and performance we&#39;re going to show how we can load close to 100 million records in just about 30 seconds as part of our data engineering tasks so we&#39;ll go ahead and open up the zero one data engineering notebook over in our Jupiter environment and so this is where all these sort of ml use cases do how do we acquire and transform and clean up the data that we need to be able to build out the analyzes and models and everything that we need for our predictive capabilities we&#39;ve got data going back to 2013 and we&#39;ve got new data that comes in monthly it&#39;s available via S3 and we&#39;re going to produce a trips table that includes the raw records of you know this customer took a trip from this station to this other station it costs X dollars and they had the bike for so long all that kind of stuff that trip stable is going to serve as the foundation for our forecasting model later on to get started we&#39;re just going to import some relevant packages and modules that we need as well as the snowpark client API we&#39;re going to use that same snowpark connect function that we&#39;ve built in the setup to establish a session for a data engineer to interact with our snowflake instance next thing we&#39;re going to do is actually create some stages for loading the data into snowflake if you&#39;ve worked with stages before of course you know this is kind of like the landing zone for the raw files inside a snowflake before data gets extracted and put into tables we&#39;re also going to create the relevant databases schemas and stages that we need as part of this task as you go through this you&#39;ll see there&#39;s a couple of different things happening here the first is that we have differing changes of file name formats we&#39;ll actually see the schemas for the underlying data changes all that kind of stuff so all we&#39;re doing here is taking some files from that S3 bucket building out the collection of file names that we want to extract and load into Snowflake and then figuring out what schemas go with what so you&#39;ll see here that we&#39;ve got you know a couple sample data files that we&#39;re going to use to start testing out our kind of loading and then we&#39;ll ultimately bulk in just a whole bunch of historical data once we get it all worked out we&#39;ve got our feature engineering Warehouse on the snowflake side that we&#39;re going to use to do this you can see here that we&#39;ve got files from two different schemas so back pre-2017 we had one particular schema once we got into 2017 we had another schema introduced in the data so we&#39;re gonna have to treat files from these different schemas differently first thing we&#39;re going to do is download unzip and then load those files to Stage so when we run the cell here that&#39;s exactly what we&#39;re going to do take the data from S3 unzip it load it to the stage you&#39;ll see here that within our load stage we have schema 1 and schema 2 the files for each relevant schema get put into the corresponding subdirectory in that stage and then we&#39;ll actually load the data from those schemas into tables once you&#39;ve done that you can also then just run a standard list command in that stage to see the actual files that are located there that matches up with what we&#39;re expecting looks good how do we actually load that data then into a snowflake table the first is that we&#39;re going to use the snowpark API to create table schemas which we&#39;ll then use to actually ingest the data as I mentioned we had a schema change at one point and so we&#39;ve got two different schemas that we&#39;re initializing here we&#39;re going to just load all the raw data as raw strings and then go back in our transform data set and actually create very the relevant data types that we need for lat Longs TR you know start time stop times we&#39;ll have date time type objects all that kind of stuff The Struck field and data type commands that we&#39;re using from the snowpark API can be thought of as very similar and map really well to like your SQL data types effectively just the python kind of representation of the corresponding SQL type that&#39;ll land inside of the table once we&#39;ve got the schemas themselves we can then also create a common transform schema which includes what how will then combine the schema 1 schema 2 Data into a single trips table that&#39;s what we do here next thing is where we&#39;re really going to start using the data frame API so the first thing we&#39;re going to do is create empty tables for our raw schema 1 in schema 2 files to do that we use the session.create data frame command we&#39;re just inputting basically an empty set of data with the particular schema reference that we just built above we&#39;re going to drop any null records that actually get inserted as part of that create data frame command and then save that data frame as a permanent table under the table name schema 1 in our snowflake account and then we do the exact same thing with schema 2. so this is you could obviously do a create or replace table SQL command to do this the create data frame effectively does the same thing for us here where we establish Empty Tables using a particular schema but we don&#39;t have to write any SQL to do that we just use these familiar kind of python commands the next thing we&#39;re going to do is use our session object to actually read the files that are in our snowflake stage with a number of options again if you&#39;ve used snowflake SQL before this will actually look pretty familiar we&#39;re going to load those csvs into our the table that we just created using a copy into table command so here we&#39;re effectively reading the file and then copying the data from that stage into the snowflake table that we just created above and we&#39;ll do that for both schema 1 and schema 2. and once we have that raw data loaded in we can start transforming it and combining the data and all that kind of stuff so typically the way you do this in Snowflake before snowpark would be you write a bunch of SQL to do all this other stuff that&#39;s all well and good for those folks out there who love and are very Adept at using SQL many data scientists and data Engineers work in Python environments though too and so now you have the ability to do that same kind of work on data using the python apis one of the common sort of data engineering tasks would be like cleaning up date time formats within our data we&#39;ve got you know three different formats for date times we want to standardize all that sort of stuff to a single format so we can actually use the snowpark API and some regex commands to do that so here you&#39;ll see we&#39;ve got our data Frame snowpark data frame we&#39;re going to take the start time and do basically a regex replace on start time using these date format matches regex Expressions that are just up above and again this is all when we do these with column Replacements we&#39;re effectively going to take the data in our tables create new start time columns that do a regex replacement to get a standardized date timestamp format and then save as a new table in our trips table effectively so we&#39;re just doing some two time stamp conversions regeck Replacements on start time stop time and then selecting out some of the other common fields from our two schemas that we want to use in our trips table here once you&#39;ve done that you can see that the trips table we have a schema that&#39;s got the start time as a time stamp type stop times a time stamp type and the number of the string type fields and all that kind of stuff you can then look and see that we&#39;ve got about two million records from those sample files that we loaded so now what we&#39;re going to do is export this code that we&#39;ve done developed up above in the notebooks as functions that can be used for ML apps and orchestration and then bulk ingest data so here you&#39;ll see in the export code block all we&#39;re doing is writing to another dot py file the same code that we built above where we create the schemas and then create a conformed schema for the transform trips table as you see here and then we use the same functions that we built to extract the trips to a stage load data into stages load the data from stage into raw tables and then create transformed cleaned up tables all that good stuff just exporting all the code that we already built in The Notebook to a separate.py file so the only thing that we&#39;re going to add on top of that data engineering functionality that we already did is bringing in some bulk elt work to load in the full set of historical data from the S3 buckets that we&#39;ve staged as part of this lab I&#39;m not going to walk through these cells in tremendous detail we&#39;re actually just using those functions that we just built in the previous notebook to actually then test those out that those work and that you can load data from an S3 buckets into snowflake all that same stuff that we kind of just walked through before but if I go down to the bottom here you&#39;ll see we have two functions that we&#39;re going to create an incremental elt function this is basically just using that exact same logic that we built previously where we extract our trips data to stages load the raw data into tables load a conform data set all that kind of stuff with the concept being that the ml Ops team will use this incremental elt function in their monthly pipeline so each month when we get new data we&#39;ll run this incremental elt job on the new data to load it into our snowflake account you can see here we&#39;re just using some of those same functions that we already built in the previous section this bulk elt function that we&#39;re building here is just a helper function for the sake of the lab to bulk load the nine plus years of data that&#39;s in that S3 bucket into their snowflake account for the rest of the lab so you&#39;ll just run this cell to create these functions and then you can go down and run the you actually want to scroll almost all the way to the bottom here and get to where we do right here the bulk elt function the cells previous to this are just some examples of timing you know how long it takes to load multiple files and things like that you can run those if you want um just to explore it but then when you get here you&#39;ll run this cell to basically reset our database so clear out the data that we&#39;ve already loaded in and then bulk load in all of the actual historical records from S3 and what we&#39;ll see you should see that that takes you about 30 seconds or so to run that bulk elt job might be a little less might be a little more depending on the network and all that kind of stuff it&#39;ll be something around there and then you&#39;ll have back this tricks table if you go in and take a look at uh use the snow park session to do a record count on that snow park data frame you&#39;ll see that we&#39;ve got just shy of 92 million records in that table so 30 seconds using the snowpark API and snowflake scalability for ingest and loading of data to load in close to 100 million records and then these records are what we&#39;ll be using to build out our features and perform our modeling work later on in the lab all right so once we&#39;ve done the bulk ingest of data into snowflake we&#39;re also going to incorporate some data Marketplace features as well in particular for this particular use case we&#39;re going to use the snowflake Marketplace to subscribe to some weather and climatology data from our partners at weather Source the concept is that you know people might be less likely to ride bikes when it&#39;s extremely cold or rainy or snowy or something like that maybe not New York New Yorkers are tougher than some of the rest of us so maybe they they get on the bikes either way but it&#39;s a fair hypothesis that when it gets cold and rainy people are going to ride fewer bikes and so we want to incorporate that data as a feature in our models traditionally to do this you&#39;d have had to set up a whole another elt pipeline to bring in that data from some sort of external data source but in the snowflake Marketplace you can just subscribe to that provider&#39;s data and it&#39;s directly available to you inside of your account so if you want you can open up the zero two data Marketplace notebook walk through the instructions you&#39;ll want to log into the snowflake UI for your snowflake demo account using the Jack user and the password that we created in the zero zero setup notebook you&#39;ll click on Marketplace over on the left hand side of the snow site UI and when you click on that you&#39;ll see there&#39;s all kinds of different listings available to you in the marketplace you can feel free to take a minute to go through and kind of poke around explore there&#39;s all kinds of financial porting trading statistics covid-19 data geospatial data all kinds of really interesting stuff for the lab you&#39;re going to want to search for snowpark in the search bar up at the top and then there will be a weather Source tile that comes up named snowpark for python Hands-On lab weather data you want to go ahead and click on that data it&#39;s got weather and climatology data for New York City as I mentioned and covers going all the way back to the same historical period that our city bike data does it&#39;s updated daily and includes things like precipitation temperature wind speed direction humidity and so we&#39;re going to subscribe to that data by clicking get data on the right hand side you&#39;re going to enter weather underscore NYC for the database name select public for the role access if it&#39;s not already there and then click get data what that&#39;s going to do is create a data share from weather Source into your snowflake account so that as they Continue to update their data sets you&#39;re getting basically live real-time updates in your snowflake account too without having to manage any sort of elt pipelines or anything like that to bring the data into that count you just get it for free this is a free data set there are also lots of paid data sets on the marketplace and things like that that do have a cost associated with them but in this case you just get free access to that data you don&#39;t have to make maintain it and it&#39;s readily available for you for any of your modeling or analysis needs again just one of many many examples in the marketplace that you might want to use for your data we&#39;re also going to run there are a couple cells at the bottom here that we&#39;re going to run just to programmatically capture the share ID of that weather Source data so that we can programmatically subscribe to it if we ever needed to and not just use the snowflake UI so go ahead and run through those instructions in zero two data Marketplace and then run a few cells at the bottom and at that point you should have all of your historical City bike trip data loaded into a formatted table and the marketplace listing available in your snowflake account under the weather NYC database name and once we have that we&#39;re ready to jump into the real Nitty Gritty data science stuff so as we get into the data science tasks there&#39;s a couple of things that we&#39;re going to want to do one is just query the data in Snowflake using snowpark python creating aggregations of trip counts at the daily level that we&#39;ll need as our Target features and all that sort of stuff we want to visualize that data we&#39;re going to train a preliminary pytorch tabnet model to predict our daily trip counts and then augment that initial model feature set from the trips data with some holiday reference data and additional other weather data that we just pulled in from the snowflake Marketplace we&#39;re then going to consolidate those feature engineering functions and model training functions into python modules that the ml engineering team could then use to operationalize capability in our models again the kind of highlights from the snowflake side are the client data frame API to query and transform the data using 100 push down of the work into snowflake we&#39;re not going to extract data into our Jupiter environment except for in some very specific circumstances but even when we do extract any data or pull it into our local memory we&#39;re only going to be pulling in like highly aggregated or sampled data where all the underlying transformation to create those aggregations has been performed inside of snowflake compute and of course we&#39;ve got the marketplace to enrich our feature set using the weather data so over in your Jupiter environment you want to open up the zero two underscore data science ipy and B we also have a second arima Baseline notebook that you could also take a look at it&#39;s more just for interest if you wanted to compare using like Pi torch models versus an arena time series model or something like that but we&#39;re going to skip over that for the lab just another kind of resource for you to check out the data science notebook again what we&#39;re going to do is some feature preparation and exploratory analysis the original Source data is again almost 100 million rows and that&#39;s going to be likely too large to fit in memory on our local machine certainly it&#39;s going to be too big to fit in memory on many reasonable sized VMS even in the cloud and so snowpark Python&#39;s going to allow us to aggregate transform that data into a more appropriate kind of level of detail it&#39;ll take on the burden of all that computation for the data prep and feature engineering without us having to extract hundreds of millions of Records into some sort of environment we are going to read the data into memory for exploratory model training but we&#39;re not going to store any sort of data products or anything like that outside of snowflake so to get started with our data science work we&#39;re going to import our snowpark connect function and establish a new session to snowflake from there we&#39;re also going to just import some other packages that we need I like pandas and pytorch tabnet and all that good stuff with the snowpark API we can connect to our raw trips table pull things like our min max start time to see that we do have data going all the way back June 2013 through the end of 2019. Imagine kind of we&#39;ve rolled back in time and we&#39;re doing this as though it&#39;s beginning of 2020. you see also we&#39;ve got 92 million records which is just too much to fit in memory on our local system so one thing we might want to do is look at the daily trip count and just take a look at kind of the first few rows of that with the data frame API lets us do is use various kind of data frame style syntax and functions to do things like group buys and record counts and sorting and all that kind of stuff and we&#39;re able to pull a daily count of trip records from that 92 million records into a summarized table without having to extract or pull any data into our local machine so when I run this query and show these results a hundred percent of the compute that it takes to do this calculation is happening on the snowflake compute environment using our virtual warehouses we&#39;re not actually running anything in our local Jupiter environment here just executing the command and then all the underlying data and transformation happens inside of snowflake instead of in our local client now that daily aggregate data is small enough that we can probably fit it in memory and visualize it so we are going to go ahead and pull that using a two students call into our local environment here and then plot it using matplotlib just so we can see what our daily kind of pattern of trip count looks like as you can see here there&#39;s some clear kind of periodicity to the data we&#39;ve got you know in the winter it seems like fewer trips are taken and a general increasing Trend where more trips are being taken over time with that knowledge we might not want to have a single model that predicts the granular trip counts for individual stations so we&#39;re going to start by just experimenting with the busiest stations here we can use snowpark to identify what stations do have the most trips associated with them here we see this trade station ID 519 is our busiest with over 870 000 trips in the data set I&#39;m not surprising that ID actually corresponds to Central Park and 6th Ave which is a very popular area in New York city so we&#39;re going to use that as kind of the basis of our experimentation and then ultimately what we&#39;ll do later is parallelize out horizontally the model training and inference for every single station simultaneously in parallel but for now we&#39;ll do a filter on our data to pull back only the data that corresponds to that top station station id519 and then we&#39;ll pull that data back into our client and visualize just it you see here from plotting that that we have a very similar pattern there&#39;s a clear seasonality Trend there&#39;s some clear upward growth that happens in general as a result of that periodicity we may want to be you actually use lag features as a potential input to our regression model so the concept is that you know the number of trips that were taken one year ago might be indicative of the number of trips that should be taken today we&#39;re going to use the auto correlations in this data to confirm that guess but when we plot it we see that there&#39;s some pretty clear seasonality that becomes apparent a strong positive out of correlation for a very very recent records as well as even some meaningful positive autocorrelation when we get out you know 365 records we see the same kind of seasonality with a negative Trend when we get into that six month sort of range where as you&#39;d expect like in the summer the winter may not be as meaningful to tell us about how many trips might be expected all that kind of stuff if we take a closer look at kind of daily weekly monthly lags we do see that the 1 7 and 365 and 90 day lag variables do include some pretty meaningful autocorrelation and so that&#39;s what we&#39;re going to use as the feature set obviously of course you could include additional feature calculations and things like that into a model you might do a lot more additional expiration experimentation but for our purposes that&#39;s what we&#39;re going to pull in and use to perform those 1790 365 day lag calculations on our very very large data set we&#39;re going to use some of the built-in functions the snowpark API provides in particular the lag window function to generate those 1790 and 365 day lag features very very quickly on a large data set to do that we take our initial raw data we do our select on our particular station IDs we do our group buys station ID to get our trip counts and then we&#39;re going to do a with column on a lag on the count offset by a 1 7 90 and 365 days we&#39;ll also populate default values so for you know basically to do a 365 day lag you have to have three a full Year&#39;s worth of data so rather than throw out that initial year of data where we don&#39;t have 365 day lag we&#39;re instead going to populate it with the default value of the average of that initial one-year period so that&#39;s as simple as it gets when we actually then run that generate features function on just our top station ID we can see that we then have by day for this station the number of trips taken and then the 1790 and 365 day lags on those trip counts as well so here you&#39;ll see on day two 46 or trips are taken yesterday 33 we&#39;re taking the seven day lag because we don&#39;t have seven days of History it&#39;s just using the average but then once we get out to day eight you see that the seven day lag for day eight corresponds to the day one count and so on and so forth the same is true for 90 and 365. these are going to be the initial features that we use in our model training and inference Pipeline and so we&#39;re going to use the functionality that we built here to parallelize that out across all the various stations and be able to automate into an orchestrated pipeline here in a minute as I mentioned before you can then you do something like a simple arima Baseline model to see what kind of predictions you get that&#39;s in a different notebook we&#39;re going to skip over that for right now and instead look at using the pytorch tabnet model framework to develop an initial model to predict our drip count our function essentially just takes in our feature data those lags that we just produced and then our Target drip count as well as a cut point that is to say how much of the data should be used for training testing and validation we then basically split out the data based on that cut point and then run our tabnet regression model against the training data that we&#39;ve sent into our function this is all pretty standard like ml train test split model training kind of code so nothing really specific to snow park happening here just standard train test split model fitting Etc where this will get interesting is in the ml engineering task when we have to productionalize this and do this for hundreds of stations simultaneously similarly our predict model and our forecast function should all look very familiar you call your model that predict on some input to get your predictions you produce out your forecast into the future and we&#39;ve got some plot functions to help with that our train predict function takes those functions we just built above and then uses them on and just calls them explicitly on an input data set using our feature columns and so on and so forth and then calculates things like our mean squared error if we go ahead and then run that against our initial set of data we see that we&#39;ve got an rmsc of 17 391 we can then take a look at also what our by day for this station what the actual trip count was and then what the predicted value was and so on and so forth to get a sense of how our model is performing or not performing and if we plot our predictions versus actuals you&#39;ll see here in the blue or the actuals and the oranges predictions we seem to be capturing the pattern but we&#39;re not capturing kind of the Peaks and valleys quite yet so through the remainder of this notebook as you keep walking through it we&#39;re just going to add some holiday data sets using the snow park data frame as well you know the thinking is that on Christmas there are probably less people writing taking trips than on not a holiday but by the same token on the 4th of July like you might actually see that more trips are taken or something like that so we&#39;re going to generate that holiday data frame and then join it up with our training data all using the snow park data frame API here so we just join our holiday data frame on our training data using date fill in the non-holidays with zero and then we have holiday as a new feature that we can use in our model too call the same train tests function on that new data frame with model and we see that actually our error does in fact improve if we take a look at applying that as well same kind of pattern though we&#39;re still potentially looking for for some other features that we&#39;re missing out on so in addition to the feature sets that we&#39;re calculating we&#39;re also going to incorporate the weather data that we pulled in from the snowflake Marketplace previously as another feature in our data set so again concept is hypothesis that when it&#39;s cold and rainy people are less likely to ride bikes and so weather might be some kind of good indication of what our trip count looks like as we keep going through this notebook here you&#39;ll see we do have just the programmatic way to pull in the same data from the snowflake Marketplace that we did via the UI previously and then we&#39;re just going to create some functions to basically materialize the weather data into a particular table for the dates that we need where we have again daily precipitation and temperature pulling in just total precipitation and average temperature using the snowpark API here as you can see we check and we&#39;ve got we do have data through our whole period that we have our trip data and so we can then do a join on the API to do an inner join on the date fields and you see here we then have by day by trip the total count we&#39;ve got our lag features that we calculated we have a holiday indicator and then we have the precipitation and average temperature available to us this is going to comprise the full feature set for our model as we rerun our train and predict function that we built previously we&#39;ll rerun that again if we look at the end of the epics we see that we have our lowest rmsc so far we can plot out what those predicted versus actuals look like just like we did previously and we see we&#39;re seeing improved performance across the whole period for this particular station one thing that&#39;s nice about tabnet in particular is that it exposes model feature importance so we have some sort of explainability as to why the model is making various predictions this is true of also model Frameworks like xgboost and a number of the other scikit-learn implementations and things like that so we can actually extract out the feature importances for you know we see that seven day lag or precipitation and temperature in our 9va lags seem to be the most impactful the holiday 365 and one day lag aren&#39;t super impactful you might actually use this to eliminate some of your features for example in our case we&#39;ll actually use that feature importance for the station to only use the one and seven day lags in our output we&#39;re not going to use the other ones instead so we&#39;re just going to update those generate holiday data frames the generate weather data frame or generate feature functions that we builted earlier I&#39;m just going to update those to again calculator lag variables and then join up the data and so what this function exporting this into another dot Pi file allows us to do is hand this off to an ml engineer who can then implement it in sort of a more operational pipeline and lastly we will go ahead and call our generate features in our model train those functions just to see what our final sort of model framework looks like and there we have those results there so now we&#39;ve got reusable functions to construct the feature sets and do our model training and prediction out into the future we now have to hand this off to an ml engineer to perform that work at scale for all the stations that we&#39;re interested in forecasting for so as we move on to the next section here that&#39;s what we&#39;re going to be targeting so from an ml engineering perspective what we&#39;re going to kind of cover here is again taking those functions to materialize the data we need as tables incorporate some forecasted weather data for future looking predictions we&#39;re going to create a vectorized generation pipeline that allows us to train our models for all the stations in parallel using snowpark and snowflakes window functions and python udfs the reason we&#39;re going to use a UDF here is because we&#39;re performing both training and inference for all our stations because it&#39;s a Time series prediction so every time we run this pipeline we&#39;re going to have updated data and so we&#39;re going to need to retrain our model and do our inference so we&#39;re only going to be storing the predictions to a table as opposed to the model object itself we&#39;re also going to evaluate our models in parallel using a python stored procedure and then set up these functions to be used in a fully orchestrated Pipeline on airflow the real things I want to highlight right here are the power of the Python UDF and snowpark API to allow us to scale this horizontally for all of the various stations we&#39;re interested in typically if you wanted to do this you&#39;d have to either stand up a lot of different compute infrastructure to run all these models in parallel or you would just iterate through them and do one at a time what the virtual compute that snowflake provides in combination with snowpark is it allows us to do parallelized model training and forecasting for all our stations simultaneously so we&#39;re talking about over you know seven years of data for hundreds of different stations and doing that work all at once instead of in a very computationally intensive serial way so to get started we&#39;ll open up the zero three ml engineering lab and go ahead and go back to the top here I think I started scrolling down now one thing that I want to mention is that really what we&#39;re going to be using in this notebook is the functions that our data scientist already wrote so I&#39;m not going to go into a ton of detail about the internals of the function but I will explain the kind of UDF specific pieces that allow us to do the parallelization in particular so first we&#39;re going to do our connection to snowflake just like before we&#39;re going to import some functions that we need all that good stuff again instead of calculating our holiday data every time we run the pipeline we&#39;re just going to save that to a table persisted in Snowflake so we don&#39;t have to continue to to rerun that every time because the holidays obviously aren&#39;t changing we&#39;ll make sure we&#39;re subscribed to the weather data and then we&#39;re going to create a view on the weather data that is specifically based off of our generate weather data frame function so the snow park API Transformations that sequence of Transformations is now being translated into a view inside a snowflake you just query The View directly and get the feature set back without having to know the intricacies of the actual Transformations that are necessarily taking place the next thing we&#39;re going to do is just create our forecast data frame so this is what joins up our future holidays with future weather forecast data again not anything tremendously exciting happening here just adding some time Deltas to fill in future looking precipitation and temperature data in our feature set so that we can predict future values from the current feature set that&#39;s all serialized as a function it allows us to then call that function for some number of forecast stops moving forward so here&#39;s like an example of what that forecast data frame looks like we have again assuming we&#39;re in January 2020 we have forward-looking the predicted precipitation temperature and in the holiday indication for that future time period now here comes the part about parallelizing this work across all the various stations so you saw before we use window functions on the date to actually calculate our lag variables but now we also want to do that for all of the unique stations and then group within that station all the lags and group buys and everything like that so instead what we&#39;ll do is rather than looking at a top station our window will actually include the station ID as a partitioning mechanism so here you&#39;ll see the Partition by station ID and then in our historical features set calculation we have a group by station ID and date what this means is that now when we calculate our one day lag in our feature set it&#39;s not just yesterday&#39;s total account it&#39;s actually yesterday&#39;s trip count for the particular station that I&#39;m looking at right now and this is happening in parallel for all the stations whereas before we were just looking at one single station that&#39;s Central Park uh station ID so as we specify this window to be both overstation and date that allows us to calculate our lag variables appropriately and then do our left joins on our holiday and weather data to incorporate that as our input feature set we&#39;ll then just check really quickly to make sure that hey we do have for all of our stations we&#39;ve got data through the end of December we have at least two years of data for every single station and the interesting thing to note here is we actually have 516 different stations with two years of historical data and so what we&#39;re going to do is basically be training and producing forecasts for 516 stations simultaneously here&#39;s the part where we get into creating our UDF for training and inference that we&#39;ve done the feature generation parallelized for all the stations we&#39;re now going to create the UDF to do training and inference across all the stations this Station train predict function should look very familiar again we&#39;re just taking our trained test data during our fits creating our predictions creating our forecast data frame so on and so forth all kind of what we were already doing before so this is Imagine like for one station what do you functionally need to do for it you need to pull in the data fit your model produce your forecasts provide out the forecast data frame into the future we also want to incorporate our model explainability so we&#39;re going to include the feature importances output but that&#39;s all we&#39;re really doing here this isn&#39;t really any snow park specific stuff we&#39;re just you know training a model doing our forecasts the real kind of trick comes where we then deploy this as a UDF and then do an array aggregation on our station IDs to parallel process them simultaneously all right so now that we&#39;re deploying this as a UDF what that&#39;s going to allow us to do is kick off our training and predictions to run inside of snowflake All In Parallel for our various stations you see here up above is where we have our Station train predict function that&#39;s what&#39;s actually going to get serialized as a UDF in Snowflake to do that all we do is create a stage for where the function is going to get pushed to we specify the package dependencies that our code has so we have dependencies on pandas pytorch scipy some others here and then this is where we&#39;re also including our custom dependency on tabnet so in particular tabnet we have downloaded just a zip folder of that pytorch tabnet code we can then just include that this relative path in our local environment to that zip as in import dependency and then when we register our UDF and we pass in that list of imports here that&#39;ll actually package that tabnet package up for us and push it to the snowflake runtime so that we can use our tabnet regression model inside of our UDF even though it&#39;s not available through Anaconda directly again there&#39;s a ton of packages all these other ones that are listed here for example that are supported via the Anaconda integration but in the event that you&#39;re using a library that isn&#39;t you can still include it as a dependency on your udfs udtf stored procedures Etc you can also this is we find this really handy for our snowflake customers who have like custom homegrown built internally packages and libraries that they use and share you can then just download the a zip folder of that code and include it as an import dependency on your udfs and you don&#39;t have to like duplicate functionality in many places you can continue to use those custom homegrown libraries that your teams have already built with the UDF registration we give it a name this is how we&#39;ll be invoked as part of a SQL or a snowpark API query we specify the dependencies in terms of the import dependencies and package dependencies and we specify the array types that go into the function or sorry the input types in our case it&#39;s combinations of arrays integers strings Etc and then what how we&#39;re going to return data out of it when we then run this function this makes the the function the Station train predict function available as a UDF to execute inside of the snowflake virtual Warehouse compute infrastructure as opposed to in any other separate compute environment so when we invoke the UDF to call all the station IDs in parallel we will have to do a vectorization where all of the data for a particular station gets passed to a single instance of the UDF by aggregating all the features for each station to do that all we&#39;re doing is passing in an array construct of our feature column List names so the names of our historical columns that we&#39;re interested in we call an array AG an array construct on the that list of column names and then aliased after historical data this is essentially our input features to our model training function what that then allows us to do is actually call our UDF on historical data so here you see we&#39;ve got our historical data frame we&#39;re going to join it with our forecast data frame and then just select our UDF function call passed in on the historical data this is how we&#39;re invoking the UDF to now run on on some of our stations so again this is where we&#39;re specifying the lag values you were using we specify the target column that we have the lag values array all this kind of stuff and we&#39;re going to save that the predictions that are produced by this model training job into a print test table you can see here we&#39;re running 516 models in parallel in just about 45 seconds using this approach Again by vectorizing the input parallelizing the station IDs we&#39;re able to do highly scalable training and forecasting in a very very limited manner of time you can then also extract data from that pred test table to see how the model actually performs you can plot what your accounts and preds actually look like for a particular station or for a number of stations and all those kind of stuff and then do some model evaluation we&#39;re going to use a Rex Mex for a model evaluation specifically and much in the same way that we deploy the UDF to do per station training forecasting we&#39;re just going to deploy another UDF that does per station performance evaluation so here is our function for a single station what do we want to do we&#39;ve got our input data we&#39;ve got our actuals and then we have our metrics that we&#39;re going to actually calculate our Rex scorecard here as you can see we just generated a report on a data frame and then return the output back nothing too exciting here some basic error metric calculation happening then when we deploy that as a UDF much in the same way that we did with our training and prediction function specified dependencies packages Imports Etc give it a name input types all that good stuff and we can then call that UDF from a snowpark API call or a SQL query or anything like that to evaluate all our model output on across our 516 models what you see here is we&#39;re evaluating 516 models in their performance on just over six seconds and then when we look at what our outfit actually looks like we see that here so for this station on this run date we have you know this mean absolute error this MSC Etc it&#39;s stored in this variant data type and so you can use snowflakes semi-structured processing and query capabilities to extract particular values that you might be interested in out of this or something like that you could do the same work in a stored procedure but as I mentioned this is super inefficient because it doesn&#39;t allow for parallel execution where we&#39;re trying to do this for many many stations simultaneously on the other hand the UDF implementation of our aerometrics parallelizes out in Scales across all the various stations and everything like that the other thing we can do is then as I mentioned use some of that semi-structured data processing stuff to flatten out specific things that we want from our from the prediction outputs and from our error outputs and all that kind of stuff so here we&#39;re just seeing some of the the queries that you might use to extract then various explanatory variables for features for example and that kind of stuff out of our predictions and then similarly we can do that for our evaluations too now that we&#39;ve had done all this work to parallelize out the execution of our pipeline using udfs we&#39;re just going to again consolidate those functions into a single dot Pi file which will be actually used in our airflow orchestration as our ml Ops teams build out the pipeline to run this on a regular basis as you can see all the same functions that we just went through and wrote just writing them out to a DOT Pi file here all right so the ml apps team is basically responsible for then creating and deploying the pipelines that our data scientists and ml Engineers have come up with in a reproducible way so if we hop now over into the zero four ml Ops notebook we&#39;re going to do the same kind of stuff where we&#39;ve got our credentials set up to connect to snowflake so we can start testing this out and then all we&#39;re going to do is create a single function that brings in all of the other functions that our ml engineer already created and then runs them sequentially essentially you can go through basically what the sequence of the steps in this pipeline are let&#39;s kind of run them one by one just to ensure that each of them individually work or we go through creating our holiday table subscribing to the weather data creating the weather view deploying the predictrain UDF all that good stuff applying the evalu UDF creating the feature table forecast table these are all the things that we just went through in the ml engineering notebook so I won&#39;t dig into too much of this here now what we&#39;re going to do is just consolidate that whole kind of sequence of functions into a single task which is what our airflow orchestration pipeline is going to use so we have our database setup task which basically just resets our database for an initial setup we have our incremental elt task which brings in our monthly data we have an initial bulk load task for bringing in all the historical data and then here are the corresponding tasks for each of those various functions we just went through materializing the holiday table subscribing to the weather data creating the views deploying the udfs all that kind of stuff again exactly the same as what we just did the only difference is that now it&#39;s all kind of materialized into these tasks which then airflow is going to actually use and execute all of them again are just calling the same the same functions that we already built under the hood we&#39;re doing some things like setting model ID tags on particular tables so when we run one version of the model we use a snowflake tag to tag it with the uuid so that we know from explainability purposes like what model run to this particular forecast come from all that kind of stuff so you&#39;ll see those alter table statements throughout here as well and we have the same for our evaluation and flattening all that kind of stuff the pipeline then is just a very simple sequence of those tasks so here&#39;s one-time setup where we have the order of tasks specified for our bulk setup and then we have a very similar pipeline that is the task flow for the incremental monthly job where we incrementally update our data generate the feature data generate the forecast data bulk train and predict evaluate and then flatten the output this is what you actually run on a monthly basis essentially the first version is just an initial setup pipeline that if you wanted to kind of create this whole Lab from scratch you could run that pipeline just once and then from here is how you would do your incremental loading as we write out these files to dot Pi files we can then also test that these in fact work so you can run those task flows from your jupyter notebook if you wanted to but what we&#39;re actually going to do is instead use the Astro CLI to launch the pipelines and view the pipelines as dags inside of Apache airflow so if you want to walk through these cells at the bottom here just to test out that hey you know incrementally doing this monthly seems to work what you can see here is that we&#39;re basically going from basically new data arrives to we have full new forecasts for all of our various stations in basically two minutes each time we run one of these tasks which is pretty impressive when we&#39;re talking about 515 plus stations that we&#39;re having to produce predictions and forecasts and error calculations and all that kind of stuff for once we have these task flows though and we spin up our Apache airflow with them that&#39;s where you get a very nice neat compact way to actually view the underlying jobs that we&#39;re doing so go ahead and zoom in here and as you can see here&#39;s where we have the underlying setup task so this is our first dag which actually goes through and sets up our initial database subscribes to our weather creates our weather view does our initial bulk loading of data deploys our udfs like the full kind of setup all in a a dag kind of view if we go back to our airflow home screen here you can check out the monthly task flow as well you can look at that graph you can see here&#39;s where we have our incremental elt generate our feature and forecast tables run our bulk train predict run our evaluation and then flatten our output so you get all these task dependencies you get all kinds of lineage and documentation about runs failed runs successful runs skip jobs task durations tries like all this kind of awesome information that airflow provides from an orchestration perspective all for these simple tasks that are actually executing on the snowflake side so everything built into each one of these tasks is comprised of either snowpark API calls python udfs or something like that which means that when I execute this dag so if I wanted to trigger it and I&#39;ll trigger it to actually run as though let&#39;s suppose we&#39;re running it with a March 1st update for example edit that I&#39;ll trigger it you can see here it&#39;s going to cue it up we&#39;ll see that individual jobs are running and as these complete we&#39;ll eventually have additional output inside of our snowflake account and all of these updated predictions will be made available to us the other piece as this runs we can go ahead and hop over to our streamlit app so if you go in to your Jupiter environment again the streamlit app is all available to you if you want to take a look at the underlying code that&#39;s comprising It ultimately it&#39;s just pulling some data about 100 lines of code to pull some data into some interesting charts inside of a streamlit app it&#39;s very very simple to kind of use and work through but what that means is that we have as part of our output from the models a whole kind of web application that provides to us the forecasted values so here we have from our last update from February 1st we&#39;ve got seven days of prediction for some of the most impactful stations we also have we have the predicted value in white here but we&#39;re also using those explanatory features from our model that we&#39;re saving to say you know in this case the red for station 285 on this particular day we&#39;re expecting 214 trips and the 300 165 day lag is the specific feature that&#39;s contributing most to that prediction we then have you know on February 6th there&#39;s presumably some kind of change in temperature and so that&#39;s causing the prediction to actually drop significantly and so on and so forth you can start to visualize these things in a very simple way you can share these streamlined applications out to other users you can also consume things like the model job monitors so these are for various stations the evaluation and error metrics for different job runs if we flip back to our airflow instance we should see let&#39;s see if we refresh it that it&#39;s almost done we&#39;ve got our bulk train predict task is completed we&#39;re just running our evaluations right now and then flattening the output and then once we flatten have the flattened output in our snowflake account we&#39;ll be able to refresh our streamlit app and see our most recent set of March results so there we go we&#39;re all green success we&#39;ve run updated data loading generated new features and forecasts run train predict Runner evaluation flatten the results if we go back to our streamlit application and I&#39;m just going to refresh that now you&#39;ll see that we have updated data as of February 29th from our most recent run you can see we&#39;ve got another updated error metrics new forecast looking out into March for our various stations this is now a very easy way that an end consumer could could begin to use and consume this sort of work and this pipeline from airflow can be triggered of course as we just did it but also scheduled to run on a regular interval or anything like that so to summarize again we&#39;ve gone through kind of the full flow from initial snowflake account and database setup initial data ingest elt processes data science exploration and model training ml engineering to kind of wrap this stuff up in a nice neat way and the orchestration of those jobs using airflow and the astronomer CLI and the end result is that we have this very tidy end-to-end ml pipeline that can be triggered or run on a regular interval where a hundred percent of the compute associated with this particular pipeline is executing and performs inside of snowflakes so there&#39;s no data that&#39;s leaving the snowflake environment or anything like that all of the train predict future generation Etc is using snowpark python components to run all inside of snowflake and we have a very nice neat streamlit application to visualize those results and share them with other users thanks for watching if you&#39;re interested in more resources on Snow Park Python and to spend more time with this lab you can visit the snowflake quick starts website where again you can find this lab along with many other quick starts for getting started with snow park and doing machine learning and data science on top of snowflake if you like this video please make sure to like the video and subscribe to our YouTube channel so you can see other fun content like this in the future [Music] I&#39;ve been using snow Parks silly it lets you build snowballs with confidence by removing data security and compliance barriers to production with consistent enterprise-grade governance controls pretty cool right wait what Frosty gets his own song why aren&#39;t we singing about snow park the easy way to production four or five fun use the band style packages and automated dependency management secure processing how do we make that rhyme oh yeah speed I built this myself nothing could go wrong oh no get in your snowflake snow park s [Music] I&#39;m getting so much further [Music]

hi everybody my name is Caleb bechtold and I&amp;#39;m a machine learning and AI architect in the field CTO office here at Snowflake and today I&amp;#39;m excited to walk through a lab with you on end-to-end machine learning using snowpark python so let&amp;#39;s go ahead and get started before we get started looking at all the code and all the juicy good technical stuff I do want to spend some time talking through what we&amp;#39;re going to build today and some of the various components that we&amp;#39;re going to use along the way including things about like what is really snow park and how does it work under the hood so before we get going on the code let&amp;#39;s go ahead and spend a little bit of time kind of talking about what we&amp;#39;re going to produce today at a high level what we&amp;#39;re going to build today is a scalable and secure machine learning workflow that executes entirely on the snowflake data Cloud no data movement no egress of data to some external platform it&amp;#39;s all going to run inside a snowflake we&amp;#39;re going to orchestrate this workflow using a dag in Apache airflow that&amp;#39;s primarily leveraging snow park python under the hood to incorporate all their various mlops components around data lineage and tracking reproducibility automated execution all that kind of good stuff the last thing that we&amp;#39;re going to also show is a compelling and explainable visual app using streamlit and snowpark that allows us to expose the underlying models and insights that we&amp;#39;re building along the way to an end consumer or a business user for our particular use case so something that&amp;#39;s very exciting about snow park python in particular is the ability to use it in other orchestration and external tooling Beyond snowflake to orchestrate and manage your data pipelines and your machine learning workflows without having to move data back and forth between Snowflake and some sort of external platform certainly airflow is not the only option for how to do this today but it is a very popular one and so we&amp;#39;re going to use it over the course of the demo here to orchestrate the dags and pipelines that we&amp;#39;re building along the way as part of that we&amp;#39;re going to show how kind of a whole collaborative development cycle for an ml workflow comes together using stowpark and how that can ultimately be operationalized in an airflow Dag today we&amp;#39;re actually going to show using one of our partners tools over at astronomer who provide among many other things managed airflow instance to customers and how the Astro cloud and Astro CLI tools that they&amp;#39;ve developed can help integrate with Snowflake and allow you to very very easily build patchy airflow dags that ride on top of snow park code and ultimately execute on the snowflake server-side runtime environment extending all the benefits of snowflake and what it provides to existing SQL workloads to your python data engineering and machine learning and machine learning Ops type work as well while leveraging kind of the best of breed tooling that products like airflow provides and particular the Astro tools that have been released on top of Open Source Apache airflow so one thing that we&amp;#39;re really excited about at snowflake is the growing prevalence and excitement around streamlit and in particularly what Streamlight and snowflake together can do we&amp;#39;re going to at the end of this demo actually visualize and consume the resulting machine learning derived insights that we&amp;#39;re producing via a streamlined application again doing being able to do this without having to extract and egress data to some sort of external system in order to serve it up to some sort of application user so all the underlying forecasts and insights that we&amp;#39;ll be producing and the data associated with it is all going to be living inside a snowflake but consumable through a very nice clean friendly easy to use web application without having to extract data into some sort of external database to serve that app for a little bit of context around the demo here we&amp;#39;re not going to be able to cover the entirety of snow park in this session the reason is just that there&amp;#39;s a lot you can do with snow park and we&amp;#39;re not gonna be able to get to all of it but some of the things that we are going to highlight that we will kind of you&amp;#39;ll see along the way here will be the snow plot client data frame API which we&amp;#39;ll use to perform our data ingest data prep future engineering sort of tasks associated with this workflow we&amp;#39;ll use the snow park python server side runtime and snowpark udfs in particular to do our model training inference and evaluation tasks which ultimately will get served back up to some end user through our streamlit application we&amp;#39;re also going to use custom libraries that are not natively supported by our Anaconda integration we&amp;#39;ll talk a little bit about the Anaconda integration and what that affords us but just to demonstrate as this is something we&amp;#39;re seeing with customers as well who have homegrown packages that they want to build on top of and use inside of snow park we&amp;#39;re going to show you what it looks like to use those custom libraries and ultimately show what end-to-end ml apps and orchestration of snow park capabilities in Snowflake looks like like using native snowflake internals as well as tasks with Apache airflow and then serving up our model results model explainability and monitoring sort of metrics through a streamlit web application the real purpose of this is to show the art of the possible using snow park today there&amp;#39;s many many ml tools throughout the ecosystem that provide varying kind of capabilities throughout and across the whole spectrum of maturity with machine learning and data science snow park is one of those options for snowflake customers and we&amp;#39;re excited to show kind of what does a fully built solution with snow park look like today the other piece of this is that this lab is designed to represent what the real world use cases look like for doing this kind of development and deployment sort of cycle in particular we&amp;#39;re talking about data that&amp;#39;s on the scale of what snowflake customers working with we&amp;#39;re talking about data that has to be governed and secured and where we have to use role-based access controls to manage various users interaction with that data and the development of this capability is meant to sort of mirror what the whole team-based development cycle looks like so we&amp;#39;ll have multiple users that might be using various components of code that we&amp;#39;re developing and that are responsible for different tasks throughout the life cycle and so as a result you&amp;#39;ll see kind of how that all comes together using snowpark python today the other piece of this is that it&amp;#39;s modular so each kind of tasks associated with the workflow is split up into its own module so you can start where you want backtrack revisit certain topics if you&amp;#39;re very interested in the future engineering for example you can really focus time there if you&amp;#39;re interested in the deployment and operationalization you can focus there so this is really designed to be you know interactive and give you the ability to revisit things and backtrack and do all that sort of stuff we&amp;#39;re going to cover it as much as we can in the lab today but certainly I imagine there will be kind of additional time you want to spend after this looking at some of these components in more detail so one of the questions before we get started that you might be asking yourself is what is snowpark actually there&amp;#39;s been a lot of excitement and Buzz around snow park python as we&amp;#39;ve made announcements of public preview earlier this summer at our snowflake Summit but I think there may be misconceptions out there in the world about what is snow park actually and also what is it not so fundamentally snowpark encapsulates our development framework for extending snowflake support Beyond just SQL workloads snowflake was born out of the Enterprise data warehousing world and says SQL was where we started has customers consolidate on snowflake as the data cloud and their data platform for all their various data needs across applications business intelligence machine learning and data science SQL alone is not enough to support all those various users and use cases and so snowpark is our Endeavor in extending the power of snowflake to non-sql users and non-sql workloads snow park has two fundamental components the first is a client-side API in the case of python which is what we&amp;#39;re going to talk about today you can think of this as just a library that you pip install into any sort of python environment anywhere you can run a python kernel you can connect to snowflake you can use the snowpark python API and this allows you to do data frame style Transformations on data inside a snowflake execute and compute on snowflake virtual warehouses without having to do any data extraction to your client or use any local compute resources to perform that work we&amp;#39;ll dig into more detail into how that works in just a second the other piece is the server side runtime that snowpark provides in particular what this allows you to do is execute arbitrary python code via snowpark python stored procedures user-defined functions or udfs user-defined table functions or udtfs and push that arbitrary code to run on snowflake virtual Warehouse compute resources again without having to extract any data or restricting yourself to only workloads that SQL can support in the case of machine learning and data science one kind of easy example to think of to help understand this is deploying a trained machine learning model into snowflake using snowpark python to perform batch offline inference the python runtime inside of snowpark allows you to do this at scale on data in Snowflake via python UDF and perform batch inference on that data using our compute infrastructure without having to extract data to some other environment or stand up rest apis to host your models behind API endpoints or anything like that that that&amp;#39;s an easy kind of example maybe to understand how this will work and we&amp;#39;ll show throughout the lab some various elements of the server-side runtime that we&amp;#39;ll use a fair question would be that sounds great but why should I actually consider using snowpark a lot of it has to do with streamlining architecture you can run python Java JavaScript in all kinds of different compute environments but when you run you know python in some sort of AWS environment or something like that you&amp;#39;re fundamentally become responsible for maintaining some separate set of compute infrastructure and resources there&amp;#39;s some overhead associated with doing that what snowpark allows you to do is simplify that architecture to be totally snowflake Centric you don&amp;#39;t have to manage additional compute resources but you don&amp;#39;t have to deal with movement of data between various platforms all the data exists in a single platform and then we natively support all the various users programming languages of their choosing as well as the the kind of requirements of their workloads and and what they need to be able to execute effectively you also benefit from snowflakes core capabilities and features including our Superior price performance and near zero maintenance the same things that people love snowflake 4 from a SQL perspective today extend to snow park so you will get to experience that throughout the lab today as well and the other piece is governance and I know data scientists and machine learning Engineers hate hearing about governance because it&amp;#39;s usually only in the context of some platform owner getting mad at you for not following policies or requesting access to data you&amp;#39;re not supposed to see or anything like that trust me I&amp;#39;ve been there but what snowpark allows you to do is satisfy your security and governance needs from an Enterprise level which is extremely important without inhibiting or restricting your ability to do the work that you need to do allows you to enforce consistent Enterprise Gary governance controls and security across all your workflows including machine learning and data science now that we have python support in snow park so what are some unique things about snowpark python specifically the first is extending familiar programming constructs to python users the data frame abstraction has become very very familiar and popular for python users and data scientists and that&amp;#39;s what snow park for python allows as well extending that same sort of data frame abstraction to data that lives and resides in Snowflake and for Transformations that are executed on snowflake compute resources the other piece that I want to emphasize here is the ecosystem of Open Source tools that are provided to you in snow park python as a result of our Integrations with anaconda anaconda provides a managed repository of hundreds of packages and automated dependency management to snowpark users and customers and those packages are available to you both in your local development environments where you&amp;#39;ll work and you&amp;#39;ll use the Anaconda channel that is managed for Snowflake today as part of this lab but it also includes bringing those packages and corresponding dependencies into the server-side runtime to simplify deployment of python capabilities into the snowpark python runtime on snowflake compute resources for a little more specific info on the data frame API fundamentally what this allows us to do is query data that&amp;#39;s inside a snowflake using python without having to write SQL you can see at the top here that dot style data frame kind of operation on data to do filtering and aggregations and column references and things like that that familiar API ultimately equates to 100 push down of compute into snowflake that gets lazily executed much in the way that you might be familiar with working with a pi spark API or something like that the benefit of that is that you get to achieve native snowflake performance and scale without having to manage again shuffling of data between various compute platforms or moving between languages trying to incorporate SQL into python-based applications or anything like that we now have a native API lets you work in a familiar syntactic way while achieving scalability and performance necessary when operating on data that&amp;#39;s inside of snowflake python functions and the server-side runtime allow you to bring custom python code to snowflake through a couple of different constructs fundamentally the way this works is that code developed in some sort of other IDE or even inside of snow site itself gets serialized and ultimately pickled and then pushed down into snowflake to run in a secure sandbox python runtime across virtual Warehouse compute nodes again we also allow you as part of that runtime to seamlessly access hundreds of third-party packages through our integration with Anaconda including all of the most popular packages and Frameworks for your machine learning and data science needs this allows you to deploy functionality to run inside a snowflake at snowflake scale on data in Snowflake without kind of a significant engineering lift to you know incorporate some functionality that a data scientist maybe has put together into some sort of external you know production orchestration architecture and framework you just push it down to operate on data right into decide where that data lives using the same Python Programming constructs that you&amp;#39;re used to already using really simple but extremely powerful just to re-emphasize one more time also our integration with anaconda and our partners over there make hundreds you know thousand plus packages available to you for both local development purposes as well as production level runtime execution inside the snow park environment and as we get going with the lab today we&amp;#39;ll actually be using conda virtual environments to manage the compute environment that we&amp;#39;re working in locally before deploying into the snow park server side runtime one other partner that I like to call out that we&amp;#39;ll be using in the lab today is our partners over at astronomer who provide managed Apache airflow in the Astro orchestration Cloud to Joint Snowflake and astronomer customers will be in particular orchestrating our full end-to-end workflow using the Astro development tools and Astro CLI as we get through the the end of the lab here all right so let&amp;#39;s go ahead and get started you can use the link here shown on the screen to access the quick start for this lab do you want to follow along as we go through it you&amp;#39;re more than welcome to pull up that quick start has links to all the source code walks through the various sections of the lab step by step as well there&amp;#39;s also folks in the live chat here to answer questions as we go so if you do get stuck we do have some resources on hand to help out the other thing I would say is if you get behind as we go through this lab today don&amp;#39;t worry about it that quick start is there and available for you to access you&amp;#39;re always more than welcome to revisit it once we wrap up today but of course as we go feel free to ask questions in the live chat and we&amp;#39;ll we&amp;#39;ll help get you sorted out so let&amp;#39;s go ahead and get started so what you&amp;#39;re seeing here is a link to that quick start page that I mentioned before this as you can see on the left hand side here goes through all the various modules kind of walks you through the core snowflake features that we&amp;#39;re using along the way can be used to kind of follow along it also includes a link to the source code on GitHub here which you can click on that and go explore and this has all the various Jupiter notebooks that we&amp;#39;re going to need along the way so we will of course set up an environment using all those notebooks and everything else but that link is included in the quick start as well ultimately what we&amp;#39;re going to actually learn and walk through today is setting up an elt pipeline using snowpark python to bulk in just hundreds of millions of Time series records using an incremental kind of operational loading process all using the snow park python client API we&amp;#39;ll use that same API to perform exploratory data analysis model development and experimentation and feature engineering again using that API along with the pytorch tabnet framework for our model training we&amp;#39;ll then create a reproducible monitored and explainable automated pipeline for ML training and inference at scale and orchestrate that pipeline using Apache airflow exposing the results through a Streamlight application we&amp;#39;ll be demonstrating with airflow but certainly what we&amp;#39;re showing today is can be done with any other kind of python capable orchestration framework whether that&amp;#39;s a metaflow or or some other popular tool what you&amp;#39;ll need to get started is a snowflake account with the Anaconda Integrations enabled if you don&amp;#39;t already have a snowflake account set up for this lab you can use the link here to go register for a free trial account which will have everything we need to get going the only thing you&amp;#39;ll need to do is go into the billing section of that trial account and accept the Anaconda integration terms there you&amp;#39;ll also need some sort of Jupiter notebook instance and conda environment I&amp;#39;m going to be showing you today the lab from Amazon sagemaker studio lab account which is a very very handy lightweight Jupiter lab instance that our partners over at AWS provide if you want to you know do this lab at a future day you can of course request a free account with studio lab and run it there but you can also just run a local to Jupiter notebook instance using miniconda if you want to run the airflow components on your own you&amp;#39;ll also need Docker desktop to do that it won&amp;#39;t necessarily be required that you run that today we&amp;#39;ll show you what all the airflow components look like but if you do want to run it you&amp;#39;ll need Docker desktop as well and so here you&amp;#39;ll just see a preview of what we&amp;#39;re actually going to end up building so here you&amp;#39;re seeing a visual of the airflow dag that we&amp;#39;re going to build for our end-to-end work ultimately this is going to incrementally ingest new data to perform a monthly forecasting operation using snowflake snow park snow park python Pi torch and then obviously airflow to orchestrate this workflow one of the things that I want to highlight is that to do this forecasting what we&amp;#39;re actually going to be doing is training hundreds of independent Pi torch models in parallel to generate new forecasts for updated models all inside of snowflake the thing I want to emphasize here is it will be doing massive parallel training and inference of deep learning models all inside the snowflake runtime on hundreds of millions of data records and all in a very performant and scalable way we&amp;#39;ll also calculate and store model evaluation metrics and then flatten our output to be consumed and then displayed through a front-end streamlit web application like what you&amp;#39;re seeing here the use case that we&amp;#39;re going to be kind of tackling is built upon the city bike data set if you&amp;#39;ve seen a snowflake demo throughout the years you&amp;#39;ve likely seen this data set already once before ultimately what it is is a data set of trips from a bike sharing Network in New York City customers of city bike choose from thousands of bicycles across many many stations around New York City basically pay to unlock a bike ride the bike drop it off in another station and that accounts for a trip so the use case here is that City bike is looking to predict how many bicycles are they going to need can they expect to meet at each station how many trips do they expect customers to take from the various stations around New York City you can imagine that a maintenance team would want to make sure that there&amp;#39;s enough bicycles located at each station that if there are bikes that aren&amp;#39;t working at stations that are in high demand that they get repaired properly and all that good stuff so they want to relocate based on predicted demand so they want ultimately the Ops Team wants an application that&amp;#39;s going to show them how many bikes will be needed at a given station on any given day and we&amp;#39;ll build that application in this sort of Jupiter environment but we&amp;#39;re going to be exporting the code that we build along the way in a modular way to be used in an end-to-end operational deployment that&amp;#39;s orchestrated using airflow and what that really wants to we want to highlight there is it demonstrates the ability to use IDE of your choice so data scientists like working in Notebook environments but a data engineer an ml engineer I might work in some sort of other I IDE to build out these capabilities and so we&amp;#39;re going to show how snowpark python allows you to make that choice yourself and rapidly iterate explore and experiment but still create production-ready Deployable python code all right so moving on I just want to quickly kind of highlight some of the setup that we&amp;#39;ll do up here up front I&amp;#39;m not going to walk through this code in extreme detail just for sake of time it should be pretty straightforward but ultimately what we&amp;#39;re going to do is use a snowpark python session to create various roles warehouses Etc that you&amp;#39;ll need in your snowflake demo account for the rest of the lab some of the things that this will highlight are snowflakes rbac model our separation of compute and storage and the virtual Warehouse concept the other thing that I would highlight is to make sure you check out the code repos readme and ensure that you&amp;#39;ve followed the setup instructions for your local development environment whether that&amp;#39;s studiolab or local Jupiter or something like that the main things to highlight are that we have a environment EML file in the code repo so once you clone this to whether it&amp;#39;s again studio lab or some other environment there&amp;#39;s a yaml file in there that you can use to set up a virtual environment to run all the various notebooks that we need so in the readme of the code repository we do have the setup steps that you&amp;#39;ll need to get your local development environment set up it includes walking through the steps to enable the Anaconda Integrations in your snowflake account that&amp;#39;s here at the top and then also includes instructions for setting up a Jupiter environment an appropriate kernel in the stagemaker studio lab if you&amp;#39;re not using studio lab you can also clone this repository and use the same Jupiter environment.yaml file that&amp;#39;s included to create a local Jupiter instance with the appropriate versions of python and all the relevant packages this just has the instructions for doing that easily in studio lab but of course you can use mini conda or conda or pip virtual environments or anything else to do the same thing so to get started in this zero zero setup.ipy and B notebook this is where we&amp;#39;re going to first establish all the various warehouses databases schemas roles Etc that we&amp;#39;re going to need throughout the rest of the lab the first thing that we&amp;#39;re going to want to do is set up the user and account parameters here in this dictionary which are going to enable us to actually connect to snowflake using a snow park session so you&amp;#39;ll update that with your username and account that you created with the free trial account and then we&amp;#39;re going to save that dictionary as a Json file which we&amp;#39;ll use throughout the remainder of the lab to continue connecting to snowflake using snowpark this dictionary and this Json file get updated as we go with some new fields and things like that but this is what we&amp;#39;re going to use to manage connectivity to snowflake from our Jupiter environment the next thing that we&amp;#39;re going to do is actually set up our initial session so our initial snowpark session to connect to and interact with snowflake to do that we just import the Snow Park Library as s p as well as some other things that we need we&amp;#39;re going to have you enter in your password for that account admin user role that we listed above and then establish our first session here once you run this cell and enter that password you&amp;#39;ll actually be connected to snowflake at that point and able to start creating virtual warehouses creating tables schemas Etc to help feed things up too we&amp;#39;ve got an S3 bucket in AWS where the data that we need for this particular lab is pre-staged so go ahead and run these cells along the way too just to set the download URL for that underlying data so that we can ingest it into our snowflake account down below that we&amp;#39;re going to set up the actual user demo and kind of lab username so you&amp;#39;d imagine this is a particular user on your team when you run the cell we&amp;#39;re just going to create that User it&amp;#39;s going to ask for us to then in the next cell enter a new password for the demo user you can always use the same Jack user and password you create here to go to the actual snow site login for your snowflake account and then log in and explore in snow site but we&amp;#39;re just going to set these things up from our Jupiter instance for snowpark usage the next cell just goes through those various warehouses that are listed in the Json above and it&amp;#39;s going to create warehouses with those corresponding names with various different sizes all these warehouses are set to Auto suspend and auto resume so if we&amp;#39;re leaving a particular Warehouse idle for some time it&amp;#39;ll just turn off and won&amp;#39;t incur you any cost of course if you&amp;#39;re using a free trial account doesn&amp;#39;t matter anyways because it&amp;#39;s a free trial account we&amp;#39;re also going to be using the snowflake Marketplace later on in the lab and so we&amp;#39;ll Grant the ability to import shares via the marketplace to the role that we created we&amp;#39;re going to update that Json file save it out and the last thing is we&amp;#39;re just going to create a wrapper function that uses that Json file to establish a snow park session so this snowpark connect is going to get written to a DOT py file in Jupiter environment and then we&amp;#39;re going to import and use the snowpark connect as just a little helper function throughout the rest of the lab to establish our snow park sessions go ahead and import that function and test it make sure that you&amp;#39;ve got access to the various compute instances you should be all this is should be running successfully you should be setting up and building all the warehouses that you need and then we&amp;#39;ll go ahead and close this particular session and move into the data engineering lab where we&amp;#39;ll actually create the databases schemas and tables and everything that we need for our data all right so the first thing that we&amp;#39;re going to do is our core sort of data engineering tasks to get data into Snowflake and then make it accessible to our data scientists to do that we&amp;#39;re going to extract our city bike data from S3 into snowflake we&amp;#39;re going to load the raw data into snowflake tables create some other tables that have transform data that our data scientists can actually use and then export that relevant code that we developed to perform these tasks as reusable python modules and functions that our ml Ops teams are going to use in their orchestration platform later on down the road the key features here that we&amp;#39;ll be leveraging on the snow park side will be the client data frame API to do our elt type tasks and then also just highlighting snowflake&amp;#39;s core ingest and capabilities and performance we&amp;#39;re going to show how we can load close to 100 million records in just about 30 seconds as part of our data engineering tasks so we&amp;#39;ll go ahead and open up the zero one data engineering notebook over in our Jupiter environment and so this is where all these sort of ml use cases do how do we acquire and transform and clean up the data that we need to be able to build out the analyzes and models and everything that we need for our predictive capabilities we&amp;#39;ve got data going back to 2013 and we&amp;#39;ve got new data that comes in monthly it&amp;#39;s available via S3 and we&amp;#39;re going to produce a trips table that includes the raw records of you know this customer took a trip from this station to this other station it costs X dollars and they had the bike for so long all that kind of stuff that trip stable is going to serve as the foundation for our forecasting model later on to get started we&amp;#39;re just going to import some relevant packages and modules that we need as well as the snowpark client API we&amp;#39;re going to use that same snowpark connect function that we&amp;#39;ve built in the setup to establish a session for a data engineer to interact with our snowflake instance next thing we&amp;#39;re going to do is actually create some stages for loading the data into snowflake if you&amp;#39;ve worked with stages before of course you know this is kind of like the landing zone for the raw files inside a snowflake before data gets extracted and put into tables we&amp;#39;re also going to create the relevant databases schemas and stages that we need as part of this task as you go through this you&amp;#39;ll see there&amp;#39;s a couple of different things happening here the first is that we have differing changes of file name formats we&amp;#39;ll actually see the schemas for the underlying data changes all that kind of stuff so all we&amp;#39;re doing here is taking some files from that S3 bucket building out the collection of file names that we want to extract and load into Snowflake and then figuring out what schemas go with what so you&amp;#39;ll see here that we&amp;#39;ve got you know a couple sample data files that we&amp;#39;re going to use to start testing out our kind of loading and then we&amp;#39;ll ultimately bulk in just a whole bunch of historical data once we get it all worked out we&amp;#39;ve got our feature engineering Warehouse on the snowflake side that we&amp;#39;re going to use to do this you can see here that we&amp;#39;ve got files from two different schemas so back pre-2017 we had one particular schema once we got into 2017 we had another schema introduced in the data so we&amp;#39;re gonna have to treat files from these different schemas differently first thing we&amp;#39;re going to do is download unzip and then load those files to Stage so when we run the cell here that&amp;#39;s exactly what we&amp;#39;re going to do take the data from S3 unzip it load it to the stage you&amp;#39;ll see here that within our load stage we have schema 1 and schema 2 the files for each relevant schema get put into the corresponding subdirectory in that stage and then we&amp;#39;ll actually load the data from those schemas into tables once you&amp;#39;ve done that you can also then just run a standard list command in that stage to see the actual files that are located there that matches up with what we&amp;#39;re expecting looks good how do we actually load that data then into a snowflake table the first is that we&amp;#39;re going to use the snowpark API to create table schemas which we&amp;#39;ll then use to actually ingest the data as I mentioned we had a schema change at one point and so we&amp;#39;ve got two different schemas that we&amp;#39;re initializing here we&amp;#39;re going to just load all the raw data as raw strings and then go back in our transform data set and actually create very the relevant data types that we need for lat Longs TR you know start time stop times we&amp;#39;ll have date time type objects all that kind of stuff The Struck field and data type commands that we&amp;#39;re using from the snowpark API can be thought of as very similar and map really well to like your SQL data types effectively just the python kind of representation of the corresponding SQL type that&amp;#39;ll land inside of the table once we&amp;#39;ve got the schemas themselves we can then also create a common transform schema which includes what how will then combine the schema 1 schema 2 Data into a single trips table that&amp;#39;s what we do here next thing is where we&amp;#39;re really going to start using the data frame API so the first thing we&amp;#39;re going to do is create empty tables for our raw schema 1 in schema 2 files to do that we use the session.create data frame command we&amp;#39;re just inputting basically an empty set of data with the particular schema reference that we just built above we&amp;#39;re going to drop any null records that actually get inserted as part of that create data frame command and then save that data frame as a permanent table under the table name schema 1 in our snowflake account and then we do the exact same thing with schema 2. so this is you could obviously do a create or replace table SQL command to do this the create data frame effectively does the same thing for us here where we establish Empty Tables using a particular schema but we don&amp;#39;t have to write any SQL to do that we just use these familiar kind of python commands the next thing we&amp;#39;re going to do is use our session object to actually read the files that are in our snowflake stage with a number of options again if you&amp;#39;ve used snowflake SQL before this will actually look pretty familiar we&amp;#39;re going to load those csvs into our the table that we just created using a copy into table command so here we&amp;#39;re effectively reading the file and then copying the data from that stage into the snowflake table that we just created above and we&amp;#39;ll do that for both schema 1 and schema 2. and once we have that raw data loaded in we can start transforming it and combining the data and all that kind of stuff so typically the way you do this in Snowflake before snowpark would be you write a bunch of SQL to do all this other stuff that&amp;#39;s all well and good for those folks out there who love and are very Adept at using SQL many data scientists and data Engineers work in Python environments though too and so now you have the ability to do that same kind of work on data using the python apis one of the common sort of data engineering tasks would be like cleaning up date time formats within our data we&amp;#39;ve got you know three different formats for date times we want to standardize all that sort of stuff to a single format so we can actually use the snowpark API and some regex commands to do that so here you&amp;#39;ll see we&amp;#39;ve got our data Frame snowpark data frame we&amp;#39;re going to take the start time and do basically a regex replace on start time using these date format matches regex Expressions that are just up above and again this is all when we do these with column Replacements we&amp;#39;re effectively going to take the data in our tables create new start time columns that do a regex replacement to get a standardized date timestamp format and then save as a new table in our trips table effectively so we&amp;#39;re just doing some two time stamp conversions regeck Replacements on start time stop time and then selecting out some of the other common fields from our two schemas that we want to use in our trips table here once you&amp;#39;ve done that you can see that the trips table we have a schema that&amp;#39;s got the start time as a time stamp type stop times a time stamp type and the number of the string type fields and all that kind of stuff you can then look and see that we&amp;#39;ve got about two million records from those sample files that we loaded so now what we&amp;#39;re going to do is export this code that we&amp;#39;ve done developed up above in the notebooks as functions that can be used for ML apps and orchestration and then bulk ingest data so here you&amp;#39;ll see in the export code block all we&amp;#39;re doing is writing to another dot py file the same code that we built above where we create the schemas and then create a conformed schema for the transform trips table as you see here and then we use the same functions that we built to extract the trips to a stage load data into stages load the data from stage into raw tables and then create transformed cleaned up tables all that good stuff just exporting all the code that we already built in The Notebook to a separate.py file so the only thing that we&amp;#39;re going to add on top of that data engineering functionality that we already did is bringing in some bulk elt work to load in the full set of historical data from the S3 buckets that we&amp;#39;ve staged as part of this lab I&amp;#39;m not going to walk through these cells in tremendous detail we&amp;#39;re actually just using those functions that we just built in the previous notebook to actually then test those out that those work and that you can load data from an S3 buckets into snowflake all that same stuff that we kind of just walked through before but if I go down to the bottom here you&amp;#39;ll see we have two functions that we&amp;#39;re going to create an incremental elt function this is basically just using that exact same logic that we built previously where we extract our trips data to stages load the raw data into tables load a conform data set all that kind of stuff with the concept being that the ml Ops team will use this incremental elt function in their monthly pipeline so each month when we get new data we&amp;#39;ll run this incremental elt job on the new data to load it into our snowflake account you can see here we&amp;#39;re just using some of those same functions that we already built in the previous section this bulk elt function that we&amp;#39;re building here is just a helper function for the sake of the lab to bulk load the nine plus years of data that&amp;#39;s in that S3 bucket into their snowflake account for the rest of the lab so you&amp;#39;ll just run this cell to create these functions and then you can go down and run the you actually want to scroll almost all the way to the bottom here and get to where we do right here the bulk elt function the cells previous to this are just some examples of timing you know how long it takes to load multiple files and things like that you can run those if you want um just to explore it but then when you get here you&amp;#39;ll run this cell to basically reset our database so clear out the data that we&amp;#39;ve already loaded in and then bulk load in all of the actual historical records from S3 and what we&amp;#39;ll see you should see that that takes you about 30 seconds or so to run that bulk elt job might be a little less might be a little more depending on the network and all that kind of stuff it&amp;#39;ll be something around there and then you&amp;#39;ll have back this tricks table if you go in and take a look at uh use the snow park session to do a record count on that snow park data frame you&amp;#39;ll see that we&amp;#39;ve got just shy of 92 million records in that table so 30 seconds using the snowpark API and snowflake scalability for ingest and loading of data to load in close to 100 million records and then these records are what we&amp;#39;ll be using to build out our features and perform our modeling work later on in the lab all right so once we&amp;#39;ve done the bulk ingest of data into snowflake we&amp;#39;re also going to incorporate some data Marketplace features as well in particular for this particular use case we&amp;#39;re going to use the snowflake Marketplace to subscribe to some weather and climatology data from our partners at weather Source the concept is that you know people might be less likely to ride bikes when it&amp;#39;s extremely cold or rainy or snowy or something like that maybe not New York New Yorkers are tougher than some of the rest of us so maybe they they get on the bikes either way but it&amp;#39;s a fair hypothesis that when it gets cold and rainy people are going to ride fewer bikes and so we want to incorporate that data as a feature in our models traditionally to do this you&amp;#39;d have had to set up a whole another elt pipeline to bring in that data from some sort of external data source but in the snowflake Marketplace you can just subscribe to that provider&amp;#39;s data and it&amp;#39;s directly available to you inside of your account so if you want you can open up the zero two data Marketplace notebook walk through the instructions you&amp;#39;ll want to log into the snowflake UI for your snowflake demo account using the Jack user and the password that we created in the zero zero setup notebook you&amp;#39;ll click on Marketplace over on the left hand side of the snow site UI and when you click on that you&amp;#39;ll see there&amp;#39;s all kinds of different listings available to you in the marketplace you can feel free to take a minute to go through and kind of poke around explore there&amp;#39;s all kinds of financial porting trading statistics covid-19 data geospatial data all kinds of really interesting stuff for the lab you&amp;#39;re going to want to search for snowpark in the search bar up at the top and then there will be a weather Source tile that comes up named snowpark for python Hands-On lab weather data you want to go ahead and click on that data it&amp;#39;s got weather and climatology data for New York City as I mentioned and covers going all the way back to the same historical period that our city bike data does it&amp;#39;s updated daily and includes things like precipitation temperature wind speed direction humidity and so we&amp;#39;re going to subscribe to that data by clicking get data on the right hand side you&amp;#39;re going to enter weather underscore NYC for the database name select public for the role access if it&amp;#39;s not already there and then click get data what that&amp;#39;s going to do is create a data share from weather Source into your snowflake account so that as they Continue to update their data sets you&amp;#39;re getting basically live real-time updates in your snowflake account too without having to manage any sort of elt pipelines or anything like that to bring the data into that count you just get it for free this is a free data set there are also lots of paid data sets on the marketplace and things like that that do have a cost associated with them but in this case you just get free access to that data you don&amp;#39;t have to make maintain it and it&amp;#39;s readily available for you for any of your modeling or analysis needs again just one of many many examples in the marketplace that you might want to use for your data we&amp;#39;re also going to run there are a couple cells at the bottom here that we&amp;#39;re going to run just to programmatically capture the share ID of that weather Source data so that we can programmatically subscribe to it if we ever needed to and not just use the snowflake UI so go ahead and run through those instructions in zero two data Marketplace and then run a few cells at the bottom and at that point you should have all of your historical City bike trip data loaded into a formatted table and the marketplace listing available in your snowflake account under the weather NYC database name and once we have that we&amp;#39;re ready to jump into the real Nitty Gritty data science stuff so as we get into the data science tasks there&amp;#39;s a couple of things that we&amp;#39;re going to want to do one is just query the data in Snowflake using snowpark python creating aggregations of trip counts at the daily level that we&amp;#39;ll need as our Target features and all that sort of stuff we want to visualize that data we&amp;#39;re going to train a preliminary pytorch tabnet model to predict our daily trip counts and then augment that initial model feature set from the trips data with some holiday reference data and additional other weather data that we just pulled in from the snowflake Marketplace we&amp;#39;re then going to consolidate those feature engineering functions and model training functions into python modules that the ml engineering team could then use to operationalize capability in our models again the kind of highlights from the snowflake side are the client data frame API to query and transform the data using 100 push down of the work into snowflake we&amp;#39;re not going to extract data into our Jupiter environment except for in some very specific circumstances but even when we do extract any data or pull it into our local memory we&amp;#39;re only going to be pulling in like highly aggregated or sampled data where all the underlying transformation to create those aggregations has been performed inside of snowflake compute and of course we&amp;#39;ve got the marketplace to enrich our feature set using the weather data so over in your Jupiter environment you want to open up the zero two underscore data science ipy and B we also have a second arima Baseline notebook that you could also take a look at it&amp;#39;s more just for interest if you wanted to compare using like Pi torch models versus an arena time series model or something like that but we&amp;#39;re going to skip over that for the lab just another kind of resource for you to check out the data science notebook again what we&amp;#39;re going to do is some feature preparation and exploratory analysis the original Source data is again almost 100 million rows and that&amp;#39;s going to be likely too large to fit in memory on our local machine certainly it&amp;#39;s going to be too big to fit in memory on many reasonable sized VMS even in the cloud and so snowpark Python&amp;#39;s going to allow us to aggregate transform that data into a more appropriate kind of level of detail it&amp;#39;ll take on the burden of all that computation for the data prep and feature engineering without us having to extract hundreds of millions of Records into some sort of environment we are going to read the data into memory for exploratory model training but we&amp;#39;re not going to store any sort of data products or anything like that outside of snowflake so to get started with our data science work we&amp;#39;re going to import our snowpark connect function and establish a new session to snowflake from there we&amp;#39;re also going to just import some other packages that we need I like pandas and pytorch tabnet and all that good stuff with the snowpark API we can connect to our raw trips table pull things like our min max start time to see that we do have data going all the way back June 2013 through the end of 2019. Imagine kind of we&amp;#39;ve rolled back in time and we&amp;#39;re doing this as though it&amp;#39;s beginning of 2020. you see also we&amp;#39;ve got 92 million records which is just too much to fit in memory on our local system so one thing we might want to do is look at the daily trip count and just take a look at kind of the first few rows of that with the data frame API lets us do is use various kind of data frame style syntax and functions to do things like group buys and record counts and sorting and all that kind of stuff and we&amp;#39;re able to pull a daily count of trip records from that 92 million records into a summarized table without having to extract or pull any data into our local machine so when I run this query and show these results a hundred percent of the compute that it takes to do this calculation is happening on the snowflake compute environment using our virtual warehouses we&amp;#39;re not actually running anything in our local Jupiter environment here just executing the command and then all the underlying data and transformation happens inside of snowflake instead of in our local client now that daily aggregate data is small enough that we can probably fit it in memory and visualize it so we are going to go ahead and pull that using a two students call into our local environment here and then plot it using matplotlib just so we can see what our daily kind of pattern of trip count looks like as you can see here there&amp;#39;s some clear kind of periodicity to the data we&amp;#39;ve got you know in the winter it seems like fewer trips are taken and a general increasing Trend where more trips are being taken over time with that knowledge we might not want to have a single model that predicts the granular trip counts for individual stations so we&amp;#39;re going to start by just experimenting with the busiest stations here we can use snowpark to identify what stations do have the most trips associated with them here we see this trade station ID 519 is our busiest with over 870 000 trips in the data set I&amp;#39;m not surprising that ID actually corresponds to Central Park and 6th Ave which is a very popular area in New York city so we&amp;#39;re going to use that as kind of the basis of our experimentation and then ultimately what we&amp;#39;ll do later is parallelize out horizontally the model training and inference for every single station simultaneously in parallel but for now we&amp;#39;ll do a filter on our data to pull back only the data that corresponds to that top station station id519 and then we&amp;#39;ll pull that data back into our client and visualize just it you see here from plotting that that we have a very similar pattern there&amp;#39;s a clear seasonality Trend there&amp;#39;s some clear upward growth that happens in general as a result of that periodicity we may want to be you actually use lag features as a potential input to our regression model so the concept is that you know the number of trips that were taken one year ago might be indicative of the number of trips that should be taken today we&amp;#39;re going to use the auto correlations in this data to confirm that guess but when we plot it we see that there&amp;#39;s some pretty clear seasonality that becomes apparent a strong positive out of correlation for a very very recent records as well as even some meaningful positive autocorrelation when we get out you know 365 records we see the same kind of seasonality with a negative Trend when we get into that six month sort of range where as you&amp;#39;d expect like in the summer the winter may not be as meaningful to tell us about how many trips might be expected all that kind of stuff if we take a closer look at kind of daily weekly monthly lags we do see that the 1 7 and 365 and 90 day lag variables do include some pretty meaningful autocorrelation and so that&amp;#39;s what we&amp;#39;re going to use as the feature set obviously of course you could include additional feature calculations and things like that into a model you might do a lot more additional expiration experimentation but for our purposes that&amp;#39;s what we&amp;#39;re going to pull in and use to perform those 1790 365 day lag calculations on our very very large data set we&amp;#39;re going to use some of the built-in functions the snowpark API provides in particular the lag window function to generate those 1790 and 365 day lag features very very quickly on a large data set to do that we take our initial raw data we do our select on our particular station IDs we do our group buys station ID to get our trip counts and then we&amp;#39;re going to do a with column on a lag on the count offset by a 1 7 90 and 365 days we&amp;#39;ll also populate default values so for you know basically to do a 365 day lag you have to have three a full Year&amp;#39;s worth of data so rather than throw out that initial year of data where we don&amp;#39;t have 365 day lag we&amp;#39;re instead going to populate it with the default value of the average of that initial one-year period so that&amp;#39;s as simple as it gets when we actually then run that generate features function on just our top station ID we can see that we then have by day for this station the number of trips taken and then the 1790 and 365 day lags on those trip counts as well so here you&amp;#39;ll see on day two 46 or trips are taken yesterday 33 we&amp;#39;re taking the seven day lag because we don&amp;#39;t have seven days of History it&amp;#39;s just using the average but then once we get out to day eight you see that the seven day lag for day eight corresponds to the day one count and so on and so forth the same is true for 90 and 365. these are going to be the initial features that we use in our model training and inference Pipeline and so we&amp;#39;re going to use the functionality that we built here to parallelize that out across all the various stations and be able to automate into an orchestrated pipeline here in a minute as I mentioned before you can then you do something like a simple arima Baseline model to see what kind of predictions you get that&amp;#39;s in a different notebook we&amp;#39;re going to skip over that for right now and instead look at using the pytorch tabnet model framework to develop an initial model to predict our drip count our function essentially just takes in our feature data those lags that we just produced and then our Target drip count as well as a cut point that is to say how much of the data should be used for training testing and validation we then basically split out the data based on that cut point and then run our tabnet regression model against the training data that we&amp;#39;ve sent into our function this is all pretty standard like ml train test split model training kind of code so nothing really specific to snow park happening here just standard train test split model fitting Etc where this will get interesting is in the ml engineering task when we have to productionalize this and do this for hundreds of stations simultaneously similarly our predict model and our forecast function should all look very familiar you call your model that predict on some input to get your predictions you produce out your forecast into the future and we&amp;#39;ve got some plot functions to help with that our train predict function takes those functions we just built above and then uses them on and just calls them explicitly on an input data set using our feature columns and so on and so forth and then calculates things like our mean squared error if we go ahead and then run that against our initial set of data we see that we&amp;#39;ve got an rmsc of 17 391 we can then take a look at also what our by day for this station what the actual trip count was and then what the predicted value was and so on and so forth to get a sense of how our model is performing or not performing and if we plot our predictions versus actuals you&amp;#39;ll see here in the blue or the actuals and the oranges predictions we seem to be capturing the pattern but we&amp;#39;re not capturing kind of the Peaks and valleys quite yet so through the remainder of this notebook as you keep walking through it we&amp;#39;re just going to add some holiday data sets using the snow park data frame as well you know the thinking is that on Christmas there are probably less people writing taking trips than on not a holiday but by the same token on the 4th of July like you might actually see that more trips are taken or something like that so we&amp;#39;re going to generate that holiday data frame and then join it up with our training data all using the snow park data frame API here so we just join our holiday data frame on our training data using date fill in the non-holidays with zero and then we have holiday as a new feature that we can use in our model too call the same train tests function on that new data frame with model and we see that actually our error does in fact improve if we take a look at applying that as well same kind of pattern though we&amp;#39;re still potentially looking for for some other features that we&amp;#39;re missing out on so in addition to the feature sets that we&amp;#39;re calculating we&amp;#39;re also going to incorporate the weather data that we pulled in from the snowflake Marketplace previously as another feature in our data set so again concept is hypothesis that when it&amp;#39;s cold and rainy people are less likely to ride bikes and so weather might be some kind of good indication of what our trip count looks like as we keep going through this notebook here you&amp;#39;ll see we do have just the programmatic way to pull in the same data from the snowflake Marketplace that we did via the UI previously and then we&amp;#39;re just going to create some functions to basically materialize the weather data into a particular table for the dates that we need where we have again daily precipitation and temperature pulling in just total precipitation and average temperature using the snowpark API here as you can see we check and we&amp;#39;ve got we do have data through our whole period that we have our trip data and so we can then do a join on the API to do an inner join on the date fields and you see here we then have by day by trip the total count we&amp;#39;ve got our lag features that we calculated we have a holiday indicator and then we have the precipitation and average temperature available to us this is going to comprise the full feature set for our model as we rerun our train and predict function that we built previously we&amp;#39;ll rerun that again if we look at the end of the epics we see that we have our lowest rmsc so far we can plot out what those predicted versus actuals look like just like we did previously and we see we&amp;#39;re seeing improved performance across the whole period for this particular station one thing that&amp;#39;s nice about tabnet in particular is that it exposes model feature importance so we have some sort of explainability as to why the model is making various predictions this is true of also model Frameworks like xgboost and a number of the other scikit-learn implementations and things like that so we can actually extract out the feature importances for you know we see that seven day lag or precipitation and temperature in our 9va lags seem to be the most impactful the holiday 365 and one day lag aren&amp;#39;t super impactful you might actually use this to eliminate some of your features for example in our case we&amp;#39;ll actually use that feature importance for the station to only use the one and seven day lags in our output we&amp;#39;re not going to use the other ones instead so we&amp;#39;re just going to update those generate holiday data frames the generate weather data frame or generate feature functions that we builted earlier I&amp;#39;m just going to update those to again calculator lag variables and then join up the data and so what this function exporting this into another dot Pi file allows us to do is hand this off to an ml engineer who can then implement it in sort of a more operational pipeline and lastly we will go ahead and call our generate features in our model train those functions just to see what our final sort of model framework looks like and there we have those results there so now we&amp;#39;ve got reusable functions to construct the feature sets and do our model training and prediction out into the future we now have to hand this off to an ml engineer to perform that work at scale for all the stations that we&amp;#39;re interested in forecasting for so as we move on to the next section here that&amp;#39;s what we&amp;#39;re going to be targeting so from an ml engineering perspective what we&amp;#39;re going to kind of cover here is again taking those functions to materialize the data we need as tables incorporate some forecasted weather data for future looking predictions we&amp;#39;re going to create a vectorized generation pipeline that allows us to train our models for all the stations in parallel using snowpark and snowflakes window functions and python udfs the reason we&amp;#39;re going to use a UDF here is because we&amp;#39;re performing both training and inference for all our stations because it&amp;#39;s a Time series prediction so every time we run this pipeline we&amp;#39;re going to have updated data and so we&amp;#39;re going to need to retrain our model and do our inference so we&amp;#39;re only going to be storing the predictions to a table as opposed to the model object itself we&amp;#39;re also going to evaluate our models in parallel using a python stored procedure and then set up these functions to be used in a fully orchestrated Pipeline on airflow the real things I want to highlight right here are the power of the Python UDF and snowpark API to allow us to scale this horizontally for all of the various stations we&amp;#39;re interested in typically if you wanted to do this you&amp;#39;d have to either stand up a lot of different compute infrastructure to run all these models in parallel or you would just iterate through them and do one at a time what the virtual compute that snowflake provides in combination with snowpark is it allows us to do parallelized model training and forecasting for all our stations simultaneously so we&amp;#39;re talking about over you know seven years of data for hundreds of different stations and doing that work all at once instead of in a very computationally intensive serial way so to get started we&amp;#39;ll open up the zero three ml engineering lab and go ahead and go back to the top here I think I started scrolling down now one thing that I want to mention is that really what we&amp;#39;re going to be using in this notebook is the functions that our data scientist already wrote so I&amp;#39;m not going to go into a ton of detail about the internals of the function but I will explain the kind of UDF specific pieces that allow us to do the parallelization in particular so first we&amp;#39;re going to do our connection to snowflake just like before we&amp;#39;re going to import some functions that we need all that good stuff again instead of calculating our holiday data every time we run the pipeline we&amp;#39;re just going to save that to a table persisted in Snowflake so we don&amp;#39;t have to continue to to rerun that every time because the holidays obviously aren&amp;#39;t changing we&amp;#39;ll make sure we&amp;#39;re subscribed to the weather data and then we&amp;#39;re going to create a view on the weather data that is specifically based off of our generate weather data frame function so the snow park API Transformations that sequence of Transformations is now being translated into a view inside a snowflake you just query The View directly and get the feature set back without having to know the intricacies of the actual Transformations that are necessarily taking place the next thing we&amp;#39;re going to do is just create our forecast data frame so this is what joins up our future holidays with future weather forecast data again not anything tremendously exciting happening here just adding some time Deltas to fill in future looking precipitation and temperature data in our feature set so that we can predict future values from the current feature set that&amp;#39;s all serialized as a function it allows us to then call that function for some number of forecast stops moving forward so here&amp;#39;s like an example of what that forecast data frame looks like we have again assuming we&amp;#39;re in January 2020 we have forward-looking the predicted precipitation temperature and in the holiday indication for that future time period now here comes the part about parallelizing this work across all the various stations so you saw before we use window functions on the date to actually calculate our lag variables but now we also want to do that for all of the unique stations and then group within that station all the lags and group buys and everything like that so instead what we&amp;#39;ll do is rather than looking at a top station our window will actually include the station ID as a partitioning mechanism so here you&amp;#39;ll see the Partition by station ID and then in our historical features set calculation we have a group by station ID and date what this means is that now when we calculate our one day lag in our feature set it&amp;#39;s not just yesterday&amp;#39;s total account it&amp;#39;s actually yesterday&amp;#39;s trip count for the particular station that I&amp;#39;m looking at right now and this is happening in parallel for all the stations whereas before we were just looking at one single station that&amp;#39;s Central Park uh station ID so as we specify this window to be both overstation and date that allows us to calculate our lag variables appropriately and then do our left joins on our holiday and weather data to incorporate that as our input feature set we&amp;#39;ll then just check really quickly to make sure that hey we do have for all of our stations we&amp;#39;ve got data through the end of December we have at least two years of data for every single station and the interesting thing to note here is we actually have 516 different stations with two years of historical data and so what we&amp;#39;re going to do is basically be training and producing forecasts for 516 stations simultaneously here&amp;#39;s the part where we get into creating our UDF for training and inference that we&amp;#39;ve done the feature generation parallelized for all the stations we&amp;#39;re now going to create the UDF to do training and inference across all the stations this Station train predict function should look very familiar again we&amp;#39;re just taking our trained test data during our fits creating our predictions creating our forecast data frame so on and so forth all kind of what we were already doing before so this is Imagine like for one station what do you functionally need to do for it you need to pull in the data fit your model produce your forecasts provide out the forecast data frame into the future we also want to incorporate our model explainability so we&amp;#39;re going to include the feature importances output but that&amp;#39;s all we&amp;#39;re really doing here this isn&amp;#39;t really any snow park specific stuff we&amp;#39;re just you know training a model doing our forecasts the real kind of trick comes where we then deploy this as a UDF and then do an array aggregation on our station IDs to parallel process them simultaneously all right so now that we&amp;#39;re deploying this as a UDF what that&amp;#39;s going to allow us to do is kick off our training and predictions to run inside of snowflake All In Parallel for our various stations you see here up above is where we have our Station train predict function that&amp;#39;s what&amp;#39;s actually going to get serialized as a UDF in Snowflake to do that all we do is create a stage for where the function is going to get pushed to we specify the package dependencies that our code has so we have dependencies on pandas pytorch scipy some others here and then this is where we&amp;#39;re also including our custom dependency on tabnet so in particular tabnet we have downloaded just a zip folder of that pytorch tabnet code we can then just include that this relative path in our local environment to that zip as in import dependency and then when we register our UDF and we pass in that list of imports here that&amp;#39;ll actually package that tabnet package up for us and push it to the snowflake runtime so that we can use our tabnet regression model inside of our UDF even though it&amp;#39;s not available through Anaconda directly again there&amp;#39;s a ton of packages all these other ones that are listed here for example that are supported via the Anaconda integration but in the event that you&amp;#39;re using a library that isn&amp;#39;t you can still include it as a dependency on your udfs udtf stored procedures Etc you can also this is we find this really handy for our snowflake customers who have like custom homegrown built internally packages and libraries that they use and share you can then just download the a zip folder of that code and include it as an import dependency on your udfs and you don&amp;#39;t have to like duplicate functionality in many places you can continue to use those custom homegrown libraries that your teams have already built with the UDF registration we give it a name this is how we&amp;#39;ll be invoked as part of a SQL or a snowpark API query we specify the dependencies in terms of the import dependencies and package dependencies and we specify the array types that go into the function or sorry the input types in our case it&amp;#39;s combinations of arrays integers strings Etc and then what how we&amp;#39;re going to return data out of it when we then run this function this makes the the function the Station train predict function available as a UDF to execute inside of the snowflake virtual Warehouse compute infrastructure as opposed to in any other separate compute environment so when we invoke the UDF to call all the station IDs in parallel we will have to do a vectorization where all of the data for a particular station gets passed to a single instance of the UDF by aggregating all the features for each station to do that all we&amp;#39;re doing is passing in an array construct of our feature column List names so the names of our historical columns that we&amp;#39;re interested in we call an array AG an array construct on the that list of column names and then aliased after historical data this is essentially our input features to our model training function what that then allows us to do is actually call our UDF on historical data so here you see we&amp;#39;ve got our historical data frame we&amp;#39;re going to join it with our forecast data frame and then just select our UDF function call passed in on the historical data this is how we&amp;#39;re invoking the UDF to now run on on some of our stations so again this is where we&amp;#39;re specifying the lag values you were using we specify the target column that we have the lag values array all this kind of stuff and we&amp;#39;re going to save that the predictions that are produced by this model training job into a print test table you can see here we&amp;#39;re running 516 models in parallel in just about 45 seconds using this approach Again by vectorizing the input parallelizing the station IDs we&amp;#39;re able to do highly scalable training and forecasting in a very very limited manner of time you can then also extract data from that pred test table to see how the model actually performs you can plot what your accounts and preds actually look like for a particular station or for a number of stations and all those kind of stuff and then do some model evaluation we&amp;#39;re going to use a Rex Mex for a model evaluation specifically and much in the same way that we deploy the UDF to do per station training forecasting we&amp;#39;re just going to deploy another UDF that does per station performance evaluation so here is our function for a single station what do we want to do we&amp;#39;ve got our input data we&amp;#39;ve got our actuals and then we have our metrics that we&amp;#39;re going to actually calculate our Rex scorecard here as you can see we just generated a report on a data frame and then return the output back nothing too exciting here some basic error metric calculation happening then when we deploy that as a UDF much in the same way that we did with our training and prediction function specified dependencies packages Imports Etc give it a name input types all that good stuff and we can then call that UDF from a snowpark API call or a SQL query or anything like that to evaluate all our model output on across our 516 models what you see here is we&amp;#39;re evaluating 516 models in their performance on just over six seconds and then when we look at what our outfit actually looks like we see that here so for this station on this run date we have you know this mean absolute error this MSC Etc it&amp;#39;s stored in this variant data type and so you can use snowflakes semi-structured processing and query capabilities to extract particular values that you might be interested in out of this or something like that you could do the same work in a stored procedure but as I mentioned this is super inefficient because it doesn&amp;#39;t allow for parallel execution where we&amp;#39;re trying to do this for many many stations simultaneously on the other hand the UDF implementation of our aerometrics parallelizes out in Scales across all the various stations and everything like that the other thing we can do is then as I mentioned use some of that semi-structured data processing stuff to flatten out specific things that we want from our from the prediction outputs and from our error outputs and all that kind of stuff so here we&amp;#39;re just seeing some of the the queries that you might use to extract then various explanatory variables for features for example and that kind of stuff out of our predictions and then similarly we can do that for our evaluations too now that we&amp;#39;ve had done all this work to parallelize out the execution of our pipeline using udfs we&amp;#39;re just going to again consolidate those functions into a single dot Pi file which will be actually used in our airflow orchestration as our ml Ops teams build out the pipeline to run this on a regular basis as you can see all the same functions that we just went through and wrote just writing them out to a DOT Pi file here all right so the ml apps team is basically responsible for then creating and deploying the pipelines that our data scientists and ml Engineers have come up with in a reproducible way so if we hop now over into the zero four ml Ops notebook we&amp;#39;re going to do the same kind of stuff where we&amp;#39;ve got our credentials set up to connect to snowflake so we can start testing this out and then all we&amp;#39;re going to do is create a single function that brings in all of the other functions that our ml engineer already created and then runs them sequentially essentially you can go through basically what the sequence of the steps in this pipeline are let&amp;#39;s kind of run them one by one just to ensure that each of them individually work or we go through creating our holiday table subscribing to the weather data creating the weather view deploying the predictrain UDF all that good stuff applying the evalu UDF creating the feature table forecast table these are all the things that we just went through in the ml engineering notebook so I won&amp;#39;t dig into too much of this here now what we&amp;#39;re going to do is just consolidate that whole kind of sequence of functions into a single task which is what our airflow orchestration pipeline is going to use so we have our database setup task which basically just resets our database for an initial setup we have our incremental elt task which brings in our monthly data we have an initial bulk load task for bringing in all the historical data and then here are the corresponding tasks for each of those various functions we just went through materializing the holiday table subscribing to the weather data creating the views deploying the udfs all that kind of stuff again exactly the same as what we just did the only difference is that now it&amp;#39;s all kind of materialized into these tasks which then airflow is going to actually use and execute all of them again are just calling the same the same functions that we already built under the hood we&amp;#39;re doing some things like setting model ID tags on particular tables so when we run one version of the model we use a snowflake tag to tag it with the uuid so that we know from explainability purposes like what model run to this particular forecast come from all that kind of stuff so you&amp;#39;ll see those alter table statements throughout here as well and we have the same for our evaluation and flattening all that kind of stuff the pipeline then is just a very simple sequence of those tasks so here&amp;#39;s one-time setup where we have the order of tasks specified for our bulk setup and then we have a very similar pipeline that is the task flow for the incremental monthly job where we incrementally update our data generate the feature data generate the forecast data bulk train and predict evaluate and then flatten the output this is what you actually run on a monthly basis essentially the first version is just an initial setup pipeline that if you wanted to kind of create this whole Lab from scratch you could run that pipeline just once and then from here is how you would do your incremental loading as we write out these files to dot Pi files we can then also test that these in fact work so you can run those task flows from your jupyter notebook if you wanted to but what we&amp;#39;re actually going to do is instead use the Astro CLI to launch the pipelines and view the pipelines as dags inside of Apache airflow so if you want to walk through these cells at the bottom here just to test out that hey you know incrementally doing this monthly seems to work what you can see here is that we&amp;#39;re basically going from basically new data arrives to we have full new forecasts for all of our various stations in basically two minutes each time we run one of these tasks which is pretty impressive when we&amp;#39;re talking about 515 plus stations that we&amp;#39;re having to produce predictions and forecasts and error calculations and all that kind of stuff for once we have these task flows though and we spin up our Apache airflow with them that&amp;#39;s where you get a very nice neat compact way to actually view the underlying jobs that we&amp;#39;re doing so go ahead and zoom in here and as you can see here&amp;#39;s where we have the underlying setup task so this is our first dag which actually goes through and sets up our initial database subscribes to our weather creates our weather view does our initial bulk loading of data deploys our udfs like the full kind of setup all in a a dag kind of view if we go back to our airflow home screen here you can check out the monthly task flow as well you can look at that graph you can see here&amp;#39;s where we have our incremental elt generate our feature and forecast tables run our bulk train predict run our evaluation and then flatten our output so you get all these task dependencies you get all kinds of lineage and documentation about runs failed runs successful runs skip jobs task durations tries like all this kind of awesome information that airflow provides from an orchestration perspective all for these simple tasks that are actually executing on the snowflake side so everything built into each one of these tasks is comprised of either snowpark API calls python udfs or something like that which means that when I execute this dag so if I wanted to trigger it and I&amp;#39;ll trigger it to actually run as though let&amp;#39;s suppose we&amp;#39;re running it with a March 1st update for example edit that I&amp;#39;ll trigger it you can see here it&amp;#39;s going to cue it up we&amp;#39;ll see that individual jobs are running and as these complete we&amp;#39;ll eventually have additional output inside of our snowflake account and all of these updated predictions will be made available to us the other piece as this runs we can go ahead and hop over to our streamlit app so if you go in to your Jupiter environment again the streamlit app is all available to you if you want to take a look at the underlying code that&amp;#39;s comprising It ultimately it&amp;#39;s just pulling some data about 100 lines of code to pull some data into some interesting charts inside of a streamlit app it&amp;#39;s very very simple to kind of use and work through but what that means is that we have as part of our output from the models a whole kind of web application that provides to us the forecasted values so here we have from our last update from February 1st we&amp;#39;ve got seven days of prediction for some of the most impactful stations we also have we have the predicted value in white here but we&amp;#39;re also using those explanatory features from our model that we&amp;#39;re saving to say you know in this case the red for station 285 on this particular day we&amp;#39;re expecting 214 trips and the 300 165 day lag is the specific feature that&amp;#39;s contributing most to that prediction we then have you know on February 6th there&amp;#39;s presumably some kind of change in temperature and so that&amp;#39;s causing the prediction to actually drop significantly and so on and so forth you can start to visualize these things in a very simple way you can share these streamlined applications out to other users you can also consume things like the model job monitors so these are for various stations the evaluation and error metrics for different job runs if we flip back to our airflow instance we should see let&amp;#39;s see if we refresh it that it&amp;#39;s almost done we&amp;#39;ve got our bulk train predict task is completed we&amp;#39;re just running our evaluations right now and then flattening the output and then once we flatten have the flattened output in our snowflake account we&amp;#39;ll be able to refresh our streamlit app and see our most recent set of March results so there we go we&amp;#39;re all green success we&amp;#39;ve run updated data loading generated new features and forecasts run train predict Runner evaluation flatten the results if we go back to our streamlit application and I&amp;#39;m just going to refresh that now you&amp;#39;ll see that we have updated data as of February 29th from our most recent run you can see we&amp;#39;ve got another updated error metrics new forecast looking out into March for our various stations this is now a very easy way that an end consumer could could begin to use and consume this sort of work and this pipeline from airflow can be triggered of course as we just did it but also scheduled to run on a regular interval or anything like that so to summarize again we&amp;#39;ve gone through kind of the full flow from initial snowflake account and database setup initial data ingest elt processes data science exploration and model training ml engineering to kind of wrap this stuff up in a nice neat way and the orchestration of those jobs using airflow and the astronomer CLI and the end result is that we have this very tidy end-to-end ml pipeline that can be triggered or run on a regular interval where a hundred percent of the compute associated with this particular pipeline is executing and performs inside of snowflakes so there&amp;#39;s no data that&amp;#39;s leaving the snowflake environment or anything like that all of the train predict future generation Etc is using snowpark python components to run all inside of snowflake and we have a very nice neat streamlit application to visualize those results and share them with other users thanks for watching if you&amp;#39;re interested in more resources on Snow Park Python and to spend more time with this lab you can visit the snowflake quick starts website where again you can find this lab along with many other quick starts for getting started with snow park and doing machine learning and data science on top of snowflake if you like this video please make sure to like the video and subscribe to our YouTube channel so you can see other fun content like this in the future [Music] I&amp;#39;ve been using snow Parks silly it lets you build snowballs with confidence by removing data security and compliance barriers to production with consistent enterprise-grade governance controls pretty cool right wait what Frosty gets his own song why aren&amp;#39;t we singing about snow park the easy way to production four or five fun use the band style packages and automated dependency management secure processing how do we make that rhyme oh yeah speed I built this myself nothing could go wrong oh no get in your snowflake snow park s [Music] I&amp;#39;m getting so much further [Music]

Transcript for:End-to-End Machine Learning using Snowpark Python

Transcript for:
End-to-End Machine Learning using Snowpark Python