Exploring Apache Hadoop Ozone Architecture

Aug 22, 2024

Take quiz

Notes on Apache Hadoop Ozone

Overview of Apache Hadoop Ozone

Ozone is a sub-project of Hadoop.
It's a distributed object store that can scale to store billions of objects.
Provides S3 protocol, Hadoop File System interface, and CSI (Container Storage Interface).

Storage Approaches

Common method: Split files into smaller blocks (similar to HDFS).
Advantages:
- Easier to replicate just the blocks between data nodes.
- More efficient erasure coding.

Block Replication

Need multiple instances of blocks across data nodes to prevent data loss.
Example:
- Block 1 should be on Data Node 1, Data Node 3, etc.
- Block replication management ensures availability.

Mapping Structures

Key Space Mapping: Mapping files to blocks.
Block Space Mapping: Mapping blocks to their storage locations.
In HDFS, a single master node (NameNode) manages both mappings.
In Ozone, this functionality is split across two master servers:
1. Key to Block Mapping Service
2. Replication Management Service

Ozone Components

Core components include:
- Storage Container Manager (SCM) - for block replication.
- Ozone Manager - for key space management.
- Data Nodes - responsible for storing data.
Additional components: Web UI, prediction service, S3 compatible REST service.

Storage Container Manager (SCM)

Responsible for replicating data.
Uses network protocols to create pipelines for replication.
Heartbeat mechanism from Data Nodes to SCM to report status.

Ozone Manager

Manages volumes, buckets, keys, and provides indexing for file system clients.
Ensures integration with the lower-level replication layer (SCM).

Data Nodes

Function similarly to storage nodes in other services.
Responsible for reporting data status to SCM.

Future Topics

Next discussions will focus on the Storage Container Manager in more detail and the management of large binary objects that are replicated.

Full transcript