Spark Overview Lecture Notes

Introduction

Increasing data size every second necessitates efficient data processing for business insights.
Apache Spark is essential for processing real-time big data.
Pre Training Academy offers free courses, details in description.

History of Spark:
- Developed in 2009 at UC Berkeley as part of the AmpLab project.
- Aimed to create a cluster manager similar to YARN.
- Became an Apache project in 2012, gaining traction by 2015.
- Major improvements have been made through versions 1.x to 2.x, with 2.4.x being most stable.
RDDs (Resilient Distributed Datasets):
- Fundamental data structure in Spark, enabling distributed data processing.
- Supports fault tolerance and parallel processing.
Spark SQL:
- Allows querying structured data via SQL.
- DataFrames and Datasets provide optimized performance for queries.
RDD vs DataFrames vs Datasets:
- RDD: Unstructured data, less optimization, used for low-level data manipulation.
- DataFrame: Optimized, allows complex queries, similar to tables in databases.
- Dataset: Type-safe, combines the advantages of RDDs and DataFrames.

To start with Spark, necessary to understand RDDs, transformations, and actions.
Transformations: Functions that produce a new dataset from an existing one (e.g., map, filter).
Actions: Functions that return a value to the driver program (e.g., collect, count).
Spark SQL Commands:
- Create DataFrame: spark.read.json(