Lecture Notes: End-to-End Machine Learning using Snowpark Python

Presenter

Objective: Build a scalable and secure ML workflow on Snowflake Data Cloud using Snowpark Python.
Entire workflow runs within Snowflake without data movement.
Orchestrate using a DAG in Apache Airflow.
End-use visualization with Streamlit and Snowpark.

Snowpark: Framework extending Snowflake to support non-SQL workloads (Python, Java, JavaScript).
- Client API: DataFrame style transformations; compute on Snowflake
- Server-side Runtime: Execute arbitrary Python code, UDFs, and UDTFs using Snowflake's compute infrastructure

Data Engineering: ETL tasks using Snowpark DataFrame API; loading data into Snowflake from S3.
Feature Engineering: Create aggregate features from loaded data (e.g., lag features using window functions).
Model Training: Use PyTorch TabNet for training ML models on historical data.
Orchestration and Deployment: Automate with Apache Airflow, utilizing Astronomer's managed Airflow instance.
Visualization: Expose model results through Streamlit applications.

Goal: Ingest raw data, create structured tables for ML models.
**Steps: **Set up Snowflake sessions, create stages and tables, load data from S3.
Tools & Techniques: Snowpark DataFrame API for SQL-like transformations, regex for date formatting.
Outcome: Load millions of records in seconds, structured feature tables created.

Goal: Develop predictive model for City Bike demand prediction.
Steps: Aggregate daily trip counts using Snowpark API, create lag features, and holiday indicators.
Model Training: Use PyTorch TabNet; train-test split, fit models and evaluate using RMSE metric.
Outcome: Initial model developed, feature importance evaluated, model explainability considered.

Goal: Scale model training and inference across all stations.
Steps: Use Snowpark window functions and Python UDFs to execute parallel processing tasks.
Outcome: Train 516 models in parallel efficiently; explanations and evaluation metrics retrieved for each.

Goal: Operationalize ML model and deploy comprehensive pipelines.
Steps: Create tasks for each ETL, feature engineering, training, and evaluation step; deploy in Airflow.
Outcome: Automated and orchestrated end-to-end ML pipeline, ready for regular execution.

City Bike Data: Predict demand for bikes across stations in NYC; ensure maintenance and availability.
Reproducibility: Modular design allowing focus on specific tasks; reusable code for production.

Goal: Share insights and predictions with end-users.
Setup: Connect to Snowflake, retrieve prediction and feature importance data, and display using a web app.
Outcome: Interactive app displaying forecasts and model explanations.

Snowpark Benefits: Scalable, governed, and secure ML workflows without data movement.
Integration: Smooth integration with existing data ecosystems and tools (e.g., Anaconda, Airflow, Streamlit).
Performance: Efficient execution of heavy ML tasks within Snowflake’s environment.
Governance: Consistent security and governance across the data lifecycle.

Wrap Up: Comprehensive overview of building and deploying ML models using Snowpark and Snowflake.
Invitation: Explore more resources and continue learning about integrating data pipelines and ML workflows.