Databricks and Apache Spark Performance Tuning

Lesson 2: The Low Hanging Fruit

Presenter: Brian Caky
Series: Databricks & Apache Spark Performance Tuning
Lesson Focus: Easy and initial fixes for performance tuning by getting resources right at the start.

Cloud Performance Tuning: Often overlooked, but crucial for optimizing workloads.
Cost vs Performance: Legacy SQL servers could throw hardware at a problem, but with Databricks, it's about finding the right balance to get best performance at optimal cost.
Initial Setup: Getting resource allocation right initially is the most effective way to boost performance.
Interdependencies: Performance optimization and compute resource selection are interconnected.

Recommendation: Use Premium for better features and performance, especially for role-based access controls and Unity Catalog.
Unity Catalog & Delta Live Tables: Both require Premium tier.
Photon: Does not require Premium tier but offers excellent performance improvements.

Driver Node: Coordinates all the work, partitions data, distributes tasks to worker nodes.
Worker Nodes: Perform tasks such as data storage and execution.
- Core: Determines level of parallelism.
- Memory: Split between storage (caching) and working memory.
- Disk Speed: Important for performance; SSDs are faster.
Shuffles and Spills: Costly operations that occur during data rearrangement (e.g., joins, aggregations).

CPU and GPU Types: Use depends on workload (e.g., CPUs for data transformations, GPUs for machine learning).
Memory: Needs to be sufficient to avoid spills and crashes.
Network: Latency and bandwidth can affect performance, especially when using external storage.
Storage: Fast, localized storage (e.g., SSDs) is better.

Cluster Creation: Using Azure Databricks Workspace, selecting compute resources.
Workload Types: Different recommendations for analysis, ETL, and machine learning.
Node Types: Larger, fewer nodes generally better than many small nodes.
Spot Instances: Cost-effective but less reliable option.
Serverless Compute: Instantly available clusters, saving time but may have some costs.
Photon: Vastly improves performance by replacing JVM with C code.
Access Modes: Single user vs shared with isolation vs no isolation.
Worker and Driver Configuration: Type and size of VMs can be optimized based on workload.

Spills: Occur when nodes run out of memory, leading to data being written to disk, which is slower.
Monitoring: Use Spark UI to monitor and identify spills and shuffles.
Adaptive Query Execution (AQE): Automatically optimizes query execution plans.

Compute Resources are Critical: The foundation of performance optimization.
Premium Workspace: Generally recommended for better features and performance.
Cluster Architecture: Understanding the architecture helps in making informed choices about resource allocation.
Step-by-Step Optimization: Focus on node type, size, and configuration to match workloads.
Shuffles and Spills: Key areas where performance hits occur; monitoring and optimization are crucial.

Please refer to the video's description for related links and additional resources.