📊

End-to-End Machine Learning using Snowpark Python

Jun 13, 2024

Lecture Notes: End-to-End Machine Learning using Snowpark Python

Presenter

  • Name: Caleb Bechtold
  • Position: Machine Learning and AI Architect, Field CTO Office, Snowflake

Introduction

  • Objective: Build a scalable and secure ML workflow on Snowflake Data Cloud using Snowpark Python.
  • Entire workflow runs within Snowflake without data movement.
  • Orchestrate using a DAG in Apache Airflow.
  • End-use visualization with Streamlit and Snowpark.

Key Components

  • Snowpark: Framework extending Snowflake to support non-SQL workloads (Python, Java, JavaScript).
    • Client API: DataFrame style transformations; compute on Snowflake
    • Server-side Runtime: Execute arbitrary Python code, UDFs, and UDTFs using Snowflake's compute infrastructure

Workflow Overview

  1. Data Engineering: ETL tasks using Snowpark DataFrame API; loading data into Snowflake from S3.
  2. Feature Engineering: Create aggregate features from loaded data (e.g., lag features using window functions).
  3. Model Training: Use PyTorch TabNet for training ML models on historical data.
  4. Orchestration and Deployment: Automate with Apache Airflow, utilizing Astronomer's managed Airflow instance.
  5. Visualization: Expose model results through Streamlit applications.

Detailed Steps

Data Engineering

  • Goal: Ingest raw data, create structured tables for ML models.
  • **Steps: **Set up Snowflake sessions, create stages and tables, load data from S3.
  • Tools & Techniques: Snowpark DataFrame API for SQL-like transformations, regex for date formatting.
  • Outcome: Load millions of records in seconds, structured feature tables created.

Data Marketplace

  • Goal: Enrich feature set with external data.
  • Steps: Subscribe to Weather Source data through the Snowflake marketplace.
  • Outcome: Integrated weather data features accessible directly within Snowflake.

Data Science: Feature Engineering and Model Training

  • Goal: Develop predictive model for City Bike demand prediction.
  • Steps: Aggregate daily trip counts using Snowpark API, create lag features, and holiday indicators.
  • Model Training: Use PyTorch TabNet; train-test split, fit models and evaluate using RMSE metric.
  • Outcome: Initial model developed, feature importance evaluated, model explainability considered.

ML Engineering: Parallelization and UDFs

  • Goal: Scale model training and inference across all stations.
  • Steps: Use Snowpark window functions and Python UDFs to execute parallel processing tasks.
  • Outcome: Train 516 models in parallel efficiently; explanations and evaluation metrics retrieved for each.

MLOps: Orchestration and Pipeline Creation

  • Goal: Operationalize ML model and deploy comprehensive pipelines.
  • Steps: Create tasks for each ETL, feature engineering, training, and evaluation step; deploy in Airflow.
  • Outcome: Automated and orchestrated end-to-end ML pipeline, ready for regular execution.

Real-World Use Case

  • City Bike Data: Predict demand for bikes across stations in NYC; ensure maintenance and availability.
  • Reproducibility: Modular design allowing focus on specific tasks; reusable code for production.

Visualization with Streamlit

  • Goal: Share insights and predictions with end-users.
  • Setup: Connect to Snowflake, retrieve prediction and feature importance data, and display using a web app.
  • Outcome: Interactive app displaying forecasts and model explanations.

Key Takeaways

  • Snowpark Benefits: Scalable, governed, and secure ML workflows without data movement.
  • Integration: Smooth integration with existing data ecosystems and tools (e.g., Anaconda, Airflow, Streamlit).
  • Performance: Efficient execution of heavy ML tasks within Snowflake’s environment.
  • Governance: Consistent security and governance across the data lifecycle.

Useful Resources

  • Snowflake Quick Starts: Various ML and Snowpark labs.
  • Repository: Complete code available on GitHub for walkthrough and exercises.
  • Further Reading: Visit Snowflake's and Astronomer's websites.

Conclusion

  • Wrap Up: Comprehensive overview of building and deploying ML models using Snowpark and Snowflake.
  • Invitation: Explore more resources and continue learning about integrating data pipelines and ML workflows.