Overview
This lecture explains memory management in Apache Spark, focusing on executor memory layout, unified memory, and off-heap memory, including how memory is allocated and managed for optimal performance.
Executor Memory Management
- Spark executor memory has three main sections: on-heap memory, off-heap memory, and overhead memory.
- On-heap memory is managed by the Java Virtual Machine (JVM) and is the main area for Spark operations.
- The executor memory (e.g., 10 GB) is divided using parameters like
spark.memory.fraction and spark.memory.storageFraction.
- Overhead memory is used for internal system operations and is set as the larger of 384 MB or 10% of executor memory.
- Off-heap memory is not included by default and must be explicitly enabled and allocated.
On-Heap Memory Breakdown
- On-heap memory is split into execution memory (for joins, shuffles, sorts, aggregations), storage memory (for caching RDD/DataFrames and broadcast variables), user memory (for user objects and UDFs), and reserved memory (for Spark internals).
- User memory and reserved memory take up the space not used by execution and storage.
Unified Memory
- Unified memory refers to the combined execution and storage memory in the on-heap area.
- Spark's dynamic memory management allows execution and storage memory to borrow space from each other based on demand.
- Execution memory has priority, and storage memory will evict cached blocks using the Least Recently Used (LRU) algorithm if more space is needed for execution.
Off-Heap Memory
- Off-heap memory is managed by the operating system, not the JVM, and avoids Java garbage collection pauses.
- Developers must manually handle allocation and deallocation to prevent memory leaks.
- Off-heap memory is slower than on-heap but faster than writing to disk; suitable for specific optimization needs.
- To enable, set
spark.memory.offHeap.enabled to true and allocate size (e.g., 10-20% of executor memory).
Key Terms & Definitions
- On-heap memory — JVM-managed memory for core Spark operations.
- Off-heap memory — OS-managed memory, outside the JVM, used to reduce garbage collection overhead.
- Overhead memory — Memory used for system-level Spark executor processes.
- Execution memory — On-heap memory for operations like joins, shuffles, and aggregations.
- Storage memory — On-heap memory for caching and broadcast variables.
- Unified memory — The combined execution and storage memory that can dynamically adjust allocation.
- Garbage Collection (GC) — JVM process that frees unused objects to reclaim memory.
- LRU algorithm — Least Recently Used eviction policy for cached data.
Action Items / Next Steps
- Review Spark memory configurations:
spark.executor.memory, spark.memory.fraction, spark.memory.storageFraction, and spark.memory.offHeap.enabled.
- Experiment with allocating off-heap memory for advanced performance tuning.
- Evaluate if your caching strategies are necessary to optimize unified memory usage.