Coconote
AI notes
AI voice & video notes
Try for free
📊
Intro to AWS Data Engineering by Johnny Chivers
Jun 19, 2024
YouTube Lecture: Intro to AWS Data Engineering by Johnny Chivers
Overview
Presenter
: Johnny Chivers (Data Engineer with 10+ years experience)
Sector
: Cyber Security
Platform
: AWS
Focus
: AWS Data Engineering
Course Basis
: AWS Data Engineering Immersion Day
Resources
: Available on
Github
Key Topics
: AWS Kinesis, AWS Data Migration Service (DMS), AWS Glue
Requirements
: AWS account, minimal cost (approx. $5 for 5 days)
Course Content Outline
AWS Kinesis
Key Points
Real-Time Data Streaming
: Fully managed, scalable streaming solution by AWS.
Comparison
: AWS Kinesis vs Kafka (AWS Managed vs Open Source)
Scalability
: Can handle hundreds to thousands of records per second
Components
:
Kinesis Data Streams
: Basic building block, manage your own producers and consumers
Kinesis Data Firehose
: More managed, fewer configurations, limited output services
Kinesis Data Analytics
: Real-time processing using SQL, build real-time apps, anomaly detection
Kinesis Video Streams
: Streams video data, use cases include ML for video analysis
Architecture
Producers
: Various methods to put data on the stream (Kinesis producer library, SDK, CLI tools)
Consumers
: Methods to read from the stream (EC2 instances, Lambda functions)
Shards
: Throughput unit, manage capacity
Partition Key
: For distributing data across shards
Retention Period
: Data can persist from 24 hours to 365 days
Common Terms
:
Producer, Shard, Partition Key, Sequence Number, Consumer, Retention Period
Lab Example
Set Up Kinesis Components
: Firehose, Data Generators, Analytics, etc.
Test Anomalies
: Real-time detection of anomalies via email notifications
AWS Data Migration Service (DMS)
Key Points
Definition
: Helps migrate databases to AWS with minimal downtime
Supports
: Multiple database sources and targets, minimal schema conversion
Architecture
: Source, Target, Replication Instance, Tasks
Replication Instance
: Manages tasks
Endpoints
: Define Source and Target
Common Terms
:
Replication Instance, Endpoints, Replication Tasks, Schema Conversion Tool.
Lab Example
Setup
: Create an RDS instance, migrate to DynamoDB using DMS
Steps
:
Create RDS in Postgres, load sample data
Set up DMS Replication Instance, configure Source/Target Endpoints
Use DMS Task to replicate and transform data
AWS Glue
Key Points
Managed ETL Service
: Extract, Transform, Load data with minimal management
Runs
: Spark or Python (PySpark, Scala)
Components
:
Glue Data Catalog
: Meta repository managing table and job definitions
Crawlers
: Automate data discovery and schema inference
Jobs
: Define ETL scripts
Architecture
Data Source
Data Transformation
: ETL Script
Data Target
: S3, databases, etc.
Glue Data Catalog
: Organizes data into databases and tables
Crawlers
: Populate catalog, define schemas across data sources and targets
📄
Full transcript