Spark Memory Management Overview

Overview

This lecture explains memory management in Apache Spark, focusing on executor memory layout, unified memory, and off-heap memory, including how memory is allocated and managed for optimal performance.

Executor Memory Management

Spark executor memory has three main sections: on-heap memory, off-heap memory, and overhead memory.
On-heap memory is managed by the Java Virtual Machine (JVM) and is the main area for Spark operations.
The executor memory (e.g., 10 GB) is divided using parameters like spark.memory.fraction and spark.memory.storageFraction.
Overhead memory is used for internal system operations and is set as the larger of 384 MB or 10% of executor memory.
Off-heap memory is not included by default and must be explicitly enabled and allocated.

On-Heap Memory Breakdown

On-heap memory is split into execution memory (for joins, shuffles, sorts, aggregations), storage memory (for caching RDD/DataFrames and broadcast variables), user memory (for user objects and UDFs), and reserved memory (for Spark internals).
User memory and reserved memory take up the space not used by execution and storage.

Unified Memory

Unified memory refers to the combined execution and storage memory in the on-heap area.
Spark's dynamic memory management allows execution and storage memory to borrow space from each other based on demand.
Execution memory has priority, and storage memory will evict cached blocks using the Least Recently Used (LRU) algorithm if more space is needed for execution.

Off-Heap Memory

Off-heap memory is managed by the operating system, not the JVM, and avoids Java garbage collection pauses.
Developers must manually handle allocation and deallocation to prevent memory leaks.
Off-heap memory is slower than on-heap but faster than writing to disk; suitable for specific optimization needs.
To enable, set spark.memory.offHeap.enabled to true and allocate size (e.g., 10-20% of executor memory).

Key Terms & Definitions

On-heap memory — JVM-managed memory for core Spark operations.
Off-heap memory — OS-managed memory, outside the JVM, used to reduce garbage collection overhead.
Overhead memory — Memory used for system-level Spark executor processes.
Execution memory — On-heap memory for operations like joins, shuffles, and aggregations.
Storage memory — On-heap memory for caching and broadcast variables.
Unified memory — The combined execution and storage memory that can dynamically adjust allocation.
Garbage Collection (GC) — JVM process that frees unused objects to reclaim memory.
LRU algorithm — Least Recently Used eviction policy for cached data.

Action Items / Next Steps

Review Spark memory configurations: spark.executor.memory, spark.memory.fraction, spark.memory.storageFraction, and spark.memory.offHeap.enabled.
Experiment with allocating off-heap memory for advanced performance tuning.
Evaluate if your caching strategies are necessary to optimize unified memory usage.