🧠

Spark Memory Management Overview

Jul 10, 2025

Overview

This lecture explains memory management in Apache Spark, focusing on executor memory layout, unified memory, and off-heap memory, including how memory is allocated and managed for optimal performance.

Executor Memory Management

  • Spark executor memory has three main sections: on-heap memory, off-heap memory, and overhead memory.
  • On-heap memory is managed by the Java Virtual Machine (JVM) and is the main area for Spark operations.
  • The executor memory (e.g., 10 GB) is divided using parameters like spark.memory.fraction and spark.memory.storageFraction.
  • Overhead memory is used for internal system operations and is set as the larger of 384 MB or 10% of executor memory.
  • Off-heap memory is not included by default and must be explicitly enabled and allocated.

On-Heap Memory Breakdown

  • On-heap memory is split into execution memory (for joins, shuffles, sorts, aggregations), storage memory (for caching RDD/DataFrames and broadcast variables), user memory (for user objects and UDFs), and reserved memory (for Spark internals).
  • User memory and reserved memory take up the space not used by execution and storage.

Unified Memory

  • Unified memory refers to the combined execution and storage memory in the on-heap area.
  • Spark's dynamic memory management allows execution and storage memory to borrow space from each other based on demand.
  • Execution memory has priority, and storage memory will evict cached blocks using the Least Recently Used (LRU) algorithm if more space is needed for execution.

Off-Heap Memory

  • Off-heap memory is managed by the operating system, not the JVM, and avoids Java garbage collection pauses.
  • Developers must manually handle allocation and deallocation to prevent memory leaks.
  • Off-heap memory is slower than on-heap but faster than writing to disk; suitable for specific optimization needs.
  • To enable, set spark.memory.offHeap.enabled to true and allocate size (e.g., 10-20% of executor memory).

Key Terms & Definitions

  • On-heap memory — JVM-managed memory for core Spark operations.
  • Off-heap memory — OS-managed memory, outside the JVM, used to reduce garbage collection overhead.
  • Overhead memory — Memory used for system-level Spark executor processes.
  • Execution memory — On-heap memory for operations like joins, shuffles, and aggregations.
  • Storage memory — On-heap memory for caching and broadcast variables.
  • Unified memory — The combined execution and storage memory that can dynamically adjust allocation.
  • Garbage Collection (GC) — JVM process that frees unused objects to reclaim memory.
  • LRU algorithm — Least Recently Used eviction policy for cached data.

Action Items / Next Steps

  • Review Spark memory configurations: spark.executor.memory, spark.memory.fraction, spark.memory.storageFraction, and spark.memory.offHeap.enabled.
  • Experiment with allocating off-heap memory for advanced performance tuning.
  • Evaluate if your caching strategies are necessary to optimize unified memory usage.