Lecture Notes on Serverless SQL Pool in Synapse Analytics

Overview

Discussion on transitioning from a traditional data warehouse to a Lakehouse architecture.
Introduction to using Serverless SQL Pool in Synapse Analytics.

Pipelines: Used for orchestrating data processing (similar to Data Factory).
Data Flows: Visual data transformation without coding.
Spark Pools: Used for more control over processing, similar to Databricks.
SQL Pools: Two types: Dedicated and Serverless.

Ad-Hoc Data Exploration: Easily query data from Data Lake, including CSV, Parquet, and Delta formats.
- Use OPENROWSET function for querying external data without creating a dedicated SQL pool.
- Can perform SQL queries on data retrieved from external files.
Logical Data Warehouse (Lakehouse):
- Allows querying the Data Lake directly without needing a separate copy in SQL pool.
- Simplifies access through views or external tables.
Quick and Simple Transformations:
- Create views for data transformation but cannot create traditional tables.
- Use CREATE EXTERNAL TABLE AS SELECT (CETAS) to store results back to Data Lake.

Ad-hoc Data Exploration:
- Use for quick analysis of files (CSV, Parquet, Delta).
- OPENROWSET can read data files directly, enabling SQL-like queries.
Lakehouse Architecture:
- Query data directly from the Data Lake using T-SQL.
- No need to physically copy data to a SQL pool.
- Can create views to encapsulate complexity.
Quick Transformations:
- Create views for simple transformations on data.
- Use CETAS for persisting transformation results back to Data Lake.

Data Types:
- Serverless SQL Pool infers column types from CSV; provide explicit schema if necessary.
Delta Format:
- Only supports Delta Lake version 1.0; newer versions not supported.
No Support for Subfolders:
- OPENROWSET does not query subfolders by default.
Performance:
- Serverless may be slower than dedicated due to the need to access Data Lake for each query.

Serverless SQL Pool offers a flexible, cost-effective solution for querying and transforming data in a Data Lake environment.
Key advantages include instant availability, low-cost data processing, and support for various data formats.
Suitable for organizations looking to simplify their data architecture by leveraging a Lakehouse model.