Notes from Spark Training, Part 1

Marc Deveaux

10 min readOct 11, 2023

Teaching done by Moise P; content comes from the slides

Big data 5V

Volume: refers to huge amount of data (petabytes and exabytes)
Velocity: captures the speed at which new data is generated, processed, and made available
Variety: structured, semi-structured, and unstructured types of data
Veracity: trustworthiness of the data
Value: refers to our ability to turn our data into value

Hadoop

Open source Java based framework that is used to efficiently store and process large datasets. It runs on clusters of computers
It is a distributed system, meaning it exploits parallel computation and data locality
Data is split into blocks that are distributed across a cluster of many servers
Every server processes only the data stored in its local storage
You combine results from local data processing to produce the final output

Vocabulary

A cluster is a group of computers working together: it provides data storage, data processing, and resource management
A node is an individual computer in the cluster. The Master nodes manage distribution of work and data to worker nodes
A daemon is a program running on a node: each Hadoop daemon performs a specific function in the cluster

Hadoop Architecture

HDFS: a distributed file system that provides high-throughput access to application data
Ozone: “redundant, distributed object store optimized for Big data workloads. The primary design point of ozone is scalability, and it aims to scale to billions of objects.” (Apache Ozone website)
YARN: a general-purpose job scheduler and resource manager
MapReduce: a YARN-based system for distributed batch processing

Hadoop architecture is really flexible and you can easily replace some parts with other third party components (Ozone instead of HDFS). There are also multiple software packages that can be installed on top of it like Impala, Flume, etc.

HDFS

Used to store massive volume of data. It prefers millions of 100+ MB files rather than billions of 10 KB files
Run on clusters of many commodity servers
“write once, read many times” principle (WORM): data storage device in which information, once written, cannot be modified.
Reading the complete data as fast as possible is more important than fetching a single record
The goal of HDFS is to collect very large files and deliver them to a large number of clients while tolerating failures coming from using commodity hardware

HDFS is bad at:

handling a lot of small files. Using HDFS to process a lot of small files is a bad choice
supporting fast random writes (HDFS is optimized for append only operations)
providing low-latency data access; HSFS is optimized for high-throughput

Concept:

Files are split into fixed size chunks of B MB (B = 128 by default) called blocks. A block is the minimum amount of data readable/writable by
HDFS
Blocks are stored and replicated onto a cluster of DataNodes
A NameNode stores metadata (keeps track of how files are broken down into blocks and where they are stored)
A HDFS Client talks to the NameNode for metadata related activities and to DataNodes for reading and writing files

NameNode

It stores metadata information (e.g., file system tree) and the mapping of file blocks to DataNodes. Any change to the file system namespace or
its properties is recorded by the NameNode
It does NOT directly write or read data
To perform an I/O operation, a client must first talk to the NameNode to know the block locations, to update file system namespace, …
It is the most critical component of HDFS: The NameNode daemon must be running at all times and if the NameNode stops, the HDFS cluster becomes inaccessible

DataNode

It stores and retrieves file blocks when it is told to (by clients or the NameNode)
It sends heartbeat messages to the NameNode periodically — Every 3 sec., by default and every message contains the DataNode’s identity and
the lists of blocks that it is storing. This information is used by NameNode to confirm that the DataNode is alive
DataNodes possess various replicas of the different block (in case one fail)

Yet Another Resource Negotiator (YARN)

YARN is the Hadoop’s distributed cluster resource manager and job scheduler
The core idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons
YARN allows multiple data processing engines to run on a single Hadoop cluster. E.g., batch programs (e.g. Spark, MapReduce), interactive SQL (e.g. Impala), advanced analytics (e.g. Spark, Impala), streaming (e.g. Spark Streaming)
YARN clusters can expand to thousands of nodes managing petabytes of data
There is a Hadoop Yarn Web UI which is the entry point for the currently running Spark processes

Core components

Container: basic unit of of work in Yarn, it allocates resources (RAM, CPU, disk,…) of a single node to execute an application specific process
Resource Manager: “the ultimate authority that arbitrates resources among all the applications in the system” (Apache Hadoop website)
NodeManager: “responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.”
(Apache Hadoop website)
ApplicationMaster: “process that coordinates the execution of an application in the cluster. Each application has its own unique Application Master that is tasked with negotiating resources (Containers) from the Resource Manager and working with the Node Managers to execute and monitor the tasks.” (Cloudera documentation)

MapReduce

Programming model and a batch-based, distributed computing framework for processing large data sets on large clusters in a reliable way
Inspired (but not the same) by the functions map and reduce. Map performs a function on individual values to create a new list of values (map square 1 2 3 4 gives you 1 4 9 16). Reduce combines input values to create a new value (reduce + 1 4 9 16 gives you 30)
MapReduce simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as Input partitioning, Computational parallelization, Work distribution or Dealing with unreliable software / hardware

Hive

data warehouse platform built on the top of Hadoop that allows to read, write, and manage large datasets residing in distributed storage using an SQL-like language called HiveQL
HiveQL statements execution can be done via Spark or MapReduce
Uses schema-on-read approach: “ Schema on read differs from schema on write because you create the schema only when reading the data”
(Dell website)
Data can be stored in HDFS

Spark

Background

MapReduce has several limitations (many problems aren’t easily described as map-reduce) and it can be hard to program directly in it. For example, MapReduce doesn’t have any native support for applications that must reuse a working set of data across multiple parallel computations. This means that when a function is repeatedly applied to the same dataset (as in a machine learning algorithm when you optimize a parameter), each mapreduce job must reload the data from persistent storage (e,g, HDFS), incurring a significant performance penalty
Apache Spark was developed in response to the limitation of Mapreduce and is a distributed computing system designed for big data processing and analytics. Known for its speed and ease of use, Spark can run programs up to 100x faster than Hadoop MapReduce in memory
Unlike MapReduce which upuport a map stage and a reduce stage, Spark supports DAG with many stages such as Stage 1 (parallelize, filter,
map); Stage 2 (reducedByKey, map); Stage 3 …

Advantages

Unified Platform: Spark integrates batch processing, real-time processing, machine learning, graph processing, and interactive query analysis.
In-Memory Computation: it has the ability to process data in-memory, reducing the need for time-consuming disk I/O operations
Compute Cluster: Spark can run on a distributed cluster of machines, spreading data and computations to process data in parallel
Fault Tolerance: Despite its in-memory nature, Spark ensures data reliability and fault tolerance through lineage information. Even if a task fails, Spark can recompute lost data using the original transformations, ensuring that no data is permanently lost.
Language Flexibility: Spark supports multiple programming languages (R, python, Scala…)
Integration: Spark can easily integrate with popular big data tools and frameworks (HDFS, Cassandra, HBase, and Amazon S3 for data storage,
as well as Hadoop’s YARN and Kubernetes for cluster management)

Spark Components

Spark Core: The foundational engine for data distribution and I/O capabilities. It provides essential functionalities like task scheduling, memory management, and fault recovery.
Spark SQL: Enables querying data via SQL as well as a DataFrame API. It supports various data sources, including Hive, Avro, Parquet, ORC,
JSON, JDBC, and more.
Spark Streaming: Allows for real-time data processing by ingesting data in mini-batches
Spark MLlib: Machine learning library
Spark GraphX: For graph computation and graph-parallel computation
Cluster Manager: Spark supports standalone native clustering, Apache Mesos, and Hadoop YARN as cluster managers

Architecture

Driver Program: Central control node that runs the main function and creates the SparkContext (the code). It’s responsible for executing the
application on the cluster and coordinating tasks
Executor: A distributed agent responsible for executing tasks. Each worker node in the cluster can run multiple executors
Task: A unit of work sent by the driver to the executor
Cluster Manager: An external service responsible for acquiring resources on the cluster. As mentioned, Spark supports standalone, Mesos, and YARN.

SparkContext is the entry point to all of the components of Apache Spark. A spark application is initiated just when the spark context is created.

The distributed components in a Spark Cluster (Executor and Cluster Manager) are accesses by the Driver through a SparkSession: it is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. “The SparkContext is a singleton and can only be created once in a Spark application. The SparkSession, on the other hand, can be created multiple times within an application” (Spark code hub)

Deployment mode

Storage and partitioning

Actual physical data is distributed across storage as partitions residing in either HDFS or cloud storage. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction — as a DataFrame in memory
Each Spark executor is preferably allocated by a task that requires it to read the partition closest to it in the network, observing data locality

Data structure

RDD

RDD (Resilient Distributed dataset) is the fundamental data structure of Spark, representing an immutable distributed collection of objects.
Immutable: once created, an RDD never changes
Distributed: RDD is a logical unit from the point of view of the programmer, but internally Spark splits it in smaller partitions that are spread all around the cluster
Resilient: an RDD can be reconstructed in the case some parts of it are lost
Parallel: RDDs can be processed in parallel

There are 3 types of RDD: simple RDD, Key/value RDD where each record is a pair of two values and “double” RDD where each record contains only
number. The last 2 have their own functions such as grouping per key or computing variance.

Creating a RDD

from a dataset stored in a file system like HDFS. Number of partitions can be set by the user
by parallelizing an existing collection of data through the parallelize function which split data into n partitions
by parallel transformation from another RDD

Transformations on RDD

It always produce a new RDD from an existing one. Example of functions:

map(func): apply a function to each element of the RDD
filter(func): returns a new RDD after filtering elements
flatMap(func): like the function map() but each input item can be mapped to zero or more output items
collect(): return all the elements of the dataset as an array. Use with caution as it can overflow memory for large RDD
first(): see first element of the RDD
reduce(func): aggregates the elements of the RDD using a function
cache(): cache the RDD in memory — unpersist() to remove the RDD from cache

Dataframes

Dataframes are built on top of RDD and represent a distributed collection of data organized into named columns. They are like memory partitioned tables

Conceptually equivalent to a table in a database or a data frame in R or Python.
Can be created from various sources including structured data files, Hive tables, external databases, or existing RDDs.
Allows operations like filtering, aggregation, and more using SQL-like syntax.

Datasets
Since Spark 2.0 the Datadrame and dataset API are unified as Structured APIs. There is no dataset in Pyspark, but in Scala

Provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the advantages of DataFrames (optimizations).
Allows data to be stored as objects and processed with functional transformations.
Offers both compile-time type safety and runtime optimizations.

SparkSQL and logical plan

A module of the Apache Spark ecosystem designed for querying using SQL.

A Logical Plan refers to an abstract of all transformation steps that need to be executed. So when a SQL query is parsed, the first representation created is the Logical Plan. There are two phases of logical plans in SparkSQL

Unresolved Logical Plan: At this stage, the plan is created from the parsed SQL query, but the table names and column names are not validated against the catalog. Once validation is done, the plan transitions to a “Resolved Logical Plan:” phase where all the identifiers (table names, column names) are resolved to database objects.
Logical Plan Optimization: The Catalyst optimizer applies a series of rule-based transformations on the Resolved Logical Plan to produce an Optimized Logical Plan. Some of these transformations include constant folding, predicate pushdown, and Boolean expression simplification.
Physical Plan: The Optimized Logical Plan is then converted into a Physical Plan using cost-based optimization. This involves generating multiple candidates, evaluating them based on their cost and picking the most optimal one.
Code Generation: Once the physical plan is finalized, Spark generates bytecode to run on each node of the cluster. This bytecode is optimized for execution, ensuring tasks run as quickly as possible.
Visualizing and Understanding Plans: Spark provides tools to visualize both logical and physical plans

Notes from Spark Training, Part 1

Big data 5V

Hadoop

Vocabulary

Hadoop Architecture

HDFS

Yet Another Resource Negotiator (YARN)

Spark

Data structure

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Marc Deveaux

No responses yet