📊

Intro to AWS Data Engineering by Johnny Chivers

Jun 19, 2024

YouTube Lecture: Intro to AWS Data Engineering by Johnny Chivers

Overview

  • Presenter: Johnny Chivers (Data Engineer with 10+ years experience)
  • Sector: Cyber Security
  • Platform: AWS
  • Focus: AWS Data Engineering
  • Course Basis: AWS Data Engineering Immersion Day
  • Resources: Available on Github
  • Key Topics: AWS Kinesis, AWS Data Migration Service (DMS), AWS Glue
  • Requirements: AWS account, minimal cost (approx. $5 for 5 days)

Course Content Outline

AWS Kinesis

Key Points

  • Real-Time Data Streaming: Fully managed, scalable streaming solution by AWS.
  • Comparison: AWS Kinesis vs Kafka (AWS Managed vs Open Source)
  • Scalability: Can handle hundreds to thousands of records per second
  • Components:
    • Kinesis Data Streams: Basic building block, manage your own producers and consumers
    • Kinesis Data Firehose: More managed, fewer configurations, limited output services
    • Kinesis Data Analytics: Real-time processing using SQL, build real-time apps, anomaly detection
    • Kinesis Video Streams: Streams video data, use cases include ML for video analysis

Architecture

  • Producers: Various methods to put data on the stream (Kinesis producer library, SDK, CLI tools)
  • Consumers: Methods to read from the stream (EC2 instances, Lambda functions)
  • Shards: Throughput unit, manage capacity
  • Partition Key: For distributing data across shards
  • Retention Period: Data can persist from 24 hours to 365 days
  • Common Terms:
    • Producer, Shard, Partition Key, Sequence Number, Consumer, Retention Period

Lab Example

  • Set Up Kinesis Components: Firehose, Data Generators, Analytics, etc.
  • Test Anomalies: Real-time detection of anomalies via email notifications

AWS Data Migration Service (DMS)

Key Points

  • Definition: Helps migrate databases to AWS with minimal downtime
  • Supports: Multiple database sources and targets, minimal schema conversion
  • Architecture: Source, Target, Replication Instance, Tasks
  • Replication Instance: Manages tasks
  • Endpoints: Define Source and Target
  • Common Terms:
    • Replication Instance, Endpoints, Replication Tasks, Schema Conversion Tool.

Lab Example

  • Setup: Create an RDS instance, migrate to DynamoDB using DMS
  • Steps:
    • Create RDS in Postgres, load sample data
    • Set up DMS Replication Instance, configure Source/Target Endpoints
    • Use DMS Task to replicate and transform data

AWS Glue

Key Points

  • Managed ETL Service: Extract, Transform, Load data with minimal management
  • Runs: Spark or Python (PySpark, Scala)
  • Components:
    • Glue Data Catalog: Meta repository managing table and job definitions
    • Crawlers: Automate data discovery and schema inference
    • Jobs: Define ETL scripts

Architecture

  • Data Source
  • Data Transformation: ETL Script
  • Data Target: S3, databases, etc.
  • Glue Data Catalog: Organizes data into databases and tables
  • Crawlers: Populate catalog, define schemas across data sources and targets