Overview of AWS Glue ETL Service

Aug 22, 2024

AWS Glue Course Notes

Course Overview

  • Focus: In-depth look at AWS Glue service
  • Topics Covered:
    • What is AWS Glue and its uses
    • Key components of AWS Glue
    • AWS Glue console walkthrough
    • Hands-on lab preparation
    • Main components of AWS Glue: Data Catalog and ETL

What is AWS Glue?

  • Definition: A fully managed ETL (Extract, Transform, Load) service by AWS.
  • ETL Definition:
    • Extract data from a source
    • Transform data as needed
    • Load it to a target location
  • Fully Managed: AWS handles all backend infrastructure, provisioning, and software installations.

Key Features of AWS Glue

  1. Data Catalog:
    • Persistent technical metadata store.
    • Can connect to multiple data sources (S3, RDS, DynamoDB, etc.)
    • Centralizes metadata for easy access and monitoring.
  2. Spark ETL Engine:
    • Allows creation, running, and monitoring of ETL pipelines.

AWS Glue Components

1. Data Catalog

  • Components:
    • Databases: Logical containers for grouping tables.
    • Tables: Metadata representations of data stored in various sources.
    • Crawlers: Programs that scan data sources to infer schema and create metadata tables.
    • Connections: Configuration objects for connecting to data stores.

2. ETL

  • Components:
    • ETL Jobs: Transform data using Spark.
    • Triggers: Automate job execution based on events or schedules.
    • Workflows: Manage and coordinate multiple jobs.

AWS Glue Console Walkthrough

  • Data Catalog & ETL Sections:
    • Data Catalog includes databases, tables, and crawlers.
    • ETL includes jobs, triggers, and workflows.

Hands-On Lab Preparation

  1. Create an S3 bucket for demo purposes.
  2. Create an IAM role with necessary permissions (S3 access, CloudWatch logs).

Creating a Database and Table in AWS Glue

  • Database: Logical grouping of tables (not physical). Example: sampleDB.
  • Table Creation:
    • Tables contain metadata about data in sources but do not store data themselves.

Creating a Table using Crawlers

  • Crawlers automatically determine schema and create tables in the Data Catalog.
  • Crawler workflow:
    1. Connects to data source
    2. Scans and infers schema
    3. Creates metadata table in Glue Catalog.

Querying Data with Athena

  • After creating a table, you can use AWS Athena to run SQL queries on data stored in S3 without moving it.
  • Partitioning: Enhances query performance by filtering data based on partitions (e.g., load date).

AWS Glue Connections

  • Configuration objects to connect Glue to data stores (credentials and endpoint info).
  • Allows Glue to access data for scanning and processing.

AWS Glue ETL Jobs

  • Core functionality for transforming data.
  • Jobs can be created through visual interfaces or custom scripts.
  • Script Generation: Glue can automatically generate scripts based on visual job creation.

Creating an ETL Job Example

  1. Specify source data location (e.g., S3).
  2. Define transformations (e.g., drop fields).
  3. Specify target location for transformed data.
  4. Save and run the ETL job.

Triggers in AWS Glue

  • Automate job execution.
  • Types: Event-based or Schedule triggers (e.g., running every hour).

Summary

  • Covered the fundamentals of AWS Glue, including Data Catalog, ETL jobs, and hands-on examples.
  • Future sessions may explore additional features such as schema registry and workflows.