Understanding Serverless SQL Pool in Synapse

Aug 26, 2024

Lecture Notes on Serverless SQL Pool in Synapse Analytics

Overview

  • Discussion on transitioning from a traditional data warehouse to a Lakehouse architecture.
  • Introduction to using Serverless SQL Pool in Synapse Analytics.

Components of Synapse Analytics

  • Pipelines: Used for orchestrating data processing (similar to Data Factory).
  • Data Flows: Visual data transformation without coding.
  • Spark Pools: Used for more control over processing, similar to Databricks.
  • SQL Pools: Two types: Dedicated and Serverless.

Dedicated SQL Pool vs. Serverless SQL Pool

Dedicated SQL Pool

  • Requires explicit provisioning.
  • Fixed costs based on uptime.
  • Can store traditional tables for data warehousing.

Serverless SQL Pool

  • Automatically created in Synapse workspace.
  • Pay only for data processed (e.g., $5 per terabyte).
  • No need for infrastructure management.
  • Supports distributed query processing.

Key Features of Serverless SQL Pool

  • Ad-Hoc Data Exploration: Easily query data from Data Lake, including CSV, Parquet, and Delta formats.
    • Use OPENROWSET function for querying external data without creating a dedicated SQL pool.
    • Can perform SQL queries on data retrieved from external files.
  • Logical Data Warehouse (Lakehouse):
    • Allows querying the Data Lake directly without needing a separate copy in SQL pool.
    • Simplifies access through views or external tables.
  • Quick and Simple Transformations:
    • Create views for data transformation but cannot create traditional tables.
    • Use CREATE EXTERNAL TABLE AS SELECT (CETAS) to store results back to Data Lake.

Use Cases for Serverless SQL Pool

  1. Ad-hoc Data Exploration:
    • Use for quick analysis of files (CSV, Parquet, Delta).
    • OPENROWSET can read data files directly, enabling SQL-like queries.
  2. Lakehouse Architecture:
    • Query data directly from the Data Lake using T-SQL.
    • No need to physically copy data to a SQL pool.
    • Can create views to encapsulate complexity.
  3. Quick Transformations:
    • Create views for simple transformations on data.
    • Use CETAS for persisting transformation results back to Data Lake.

Important Considerations

  • Data Types:
    • Serverless SQL Pool infers column types from CSV; provide explicit schema if necessary.
  • Delta Format:
    • Only supports Delta Lake version 1.0; newer versions not supported.
  • No Support for Subfolders:
    • OPENROWSET does not query subfolders by default.
  • Performance:
    • Serverless may be slower than dedicated due to the need to access Data Lake for each query.

Conclusion

  • Serverless SQL Pool offers a flexible, cost-effective solution for querying and transforming data in a Data Lake environment.
  • Key advantages include instant availability, low-cost data processing, and support for various data formats.
  • Suitable for organizations looking to simplify their data architecture by leveraging a Lakehouse model.