Notes from Spark Training, Part 1

Marc Deveaux
10 min readOct 11, 2023

--

Photo by Jakob Owens on Unsplash

Teaching done by Moise P; content comes from the slides

Big data 5V

  • Volume: refers to huge amount of data (petabytes and exabytes)
  • Velocity: captures the speed at which new data is generated, processed, and made available
  • Variety: structured, semi-structured, and unstructured types of data
  • Veracity: trustworthiness of the data
  • Value: refers to our ability to turn our data into value

Hadoop

  • Open source Java based framework that is used to efficiently store and process large datasets. It runs on clusters of computers
  • It is a distributed system, meaning it exploits parallel computation and data locality
  • Data is split into blocks that are distributed across a cluster of many servers
  • Every server processes only the data stored in its local storage
  • You combine results from local data processing to produce the final output

Vocabulary

  • A cluster is a group of computers working together: it provides data storage, data processing, and resource management
  • A node is an individual computer in the cluster. The Master nodes manage distribution of work and data to worker nodes
  • A daemon is a program running on a node: each Hadoop daemon performs a specific function in the cluster

Hadoop Architecture

  • HDFS: a distributed file system that provides high-throughput access to application data
  • Ozone: “redundant, distributed object store optimized for Big data workloads. The primary design point of ozone is scalability, and it aims to scale to billions of objects.” (Apache Ozone website)
  • YARN: a general-purpose job scheduler and resource manager
  • MapReduce: a YARN-based system for distributed batch processing

Hadoop architecture is really flexible and you can easily replace some parts with other third party components (Ozone instead of HDFS). There are also multiple software packages that can be installed on top of it like Impala, Flume, etc.

HDFS

  • Used to store massive volume of data. It prefers millions of 100+ MB files rather than billions of 10 KB files
  • Run on clusters of many commodity servers
  • “write once, read many times” principle (WORM): data storage device in which information, once written, cannot be modified.
  • Reading the complete data as fast as possible is more important than fetching a single record
  • The goal of HDFS is to collect very large files and deliver them to a large number of clients while tolerating failures coming from using commodity hardware

HDFS is bad at:

  • handling a lot of small files. Using HDFS to process a lot of small files is a bad choice
  • supporting fast random writes (HDFS is optimized for append only operations)
  • providing low-latency data access; HSFS is optimized for high-throughput

Concept:

  • Files are split into fixed size chunks of B MB (B = 128 by default) called blocks. A block is the minimum amount of data readable/writable by
    HDFS
  • Blocks are stored and replicated onto a cluster of DataNodes
  • A NameNode stores metadata (keeps track of how files are broken down into blocks and where they are stored)
  • A HDFS Client talks to the NameNode for metadata related activities and to DataNodes for reading and writing files

NameNode

  • It stores metadata information (e.g., file system tree) and the mapping of file blocks to DataNodes. Any change to the file system namespace or
    its properties is recorded by the NameNode
  • It does NOT directly write or read data
  • To perform an I/O operation, a client must first talk to the NameNode to know the block locations, to update file system namespace, …
  • It is the most critical component of HDFS: The NameNode daemon must be running at all times and if the NameNode stops, the HDFS cluster becomes inaccessible

DataNode

  • It stores and retrieves file blocks when it is told to (by clients or the NameNode)
  • It sends heartbeat messages to the NameNode periodically — Every 3 sec., by default and every message contains the DataNode’s identity and
    the lists of blocks that it is storing. This information is used by NameNode to confirm that the DataNode is alive
  • DataNodes possess various replicas of the different block (in case one fail)

Yet Another Resource Negotiator (YARN)

  • YARN is the Hadoop’s distributed cluster resource manager and job scheduler
  • The core idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons
  • YARN allows multiple data processing engines to run on a single Hadoop cluster. E.g., batch programs (e.g. Spark, MapReduce), interactive SQL (e.g. Impala), advanced analytics (e.g. Spark, Impala), streaming (e.g. Spark Streaming)
  • YARN clusters can expand to thousands of nodes managing petabytes of data
  • There is a Hadoop Yarn Web UI which is the entry point for the currently running Spark processes

Core components

  • Container: basic unit of of work in Yarn, it allocates resources (RAM, CPU, disk,…) of a single node to execute an application specific process
  • Resource Manager: “the ultimate authority that arbitrates resources among all the applications in the system” (Apache Hadoop website)
  • NodeManager: “responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.”
    (Apache Hadoop website)
  • ApplicationMaster: “process that coordinates the execution of an application in the cluster. Each application has its own unique Application Master that is tasked with negotiating resources (Containers) from the Resource Manager and working with the Node Managers to execute and monitor the tasks.” (Cloudera documentation)

MapReduce

  • Programming model and a batch-based, distributed computing framework for processing large data sets on large clusters in a reliable way
  • Inspired (but not the same) by the functions map and reduce. Map performs a function on individual values to create a new list of values (map square 1 2 3 4 gives you 1 4 9 16). Reduce combines input values to create a new value (reduce + 1 4 9 16 gives you 30)
  • MapReduce simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as Input partitioning, Computational parallelization, Work distribution or Dealing with unreliable software / hardware

Hive

  • data warehouse platform built on the top of Hadoop that allows to read, write, and manage large datasets residing in distributed storage using an SQL-like language called HiveQL
  • HiveQL statements execution can be done via Spark or MapReduce
  • Uses schema-on-read approach: “ Schema on read differs from schema on write because you create the schema only when reading the data”
    (Dell website)
  • Data can be stored in HDFS

Spark

Background

  • MapReduce has several limitations (many problems aren’t easily described as map-reduce) and it can be hard to program directly in it. For example, MapReduce doesn’t have any native support for applications that must reuse a working set of data across multiple parallel computations. This means that when a function is repeatedly applied to the same dataset (as in a machine learning algorithm when you optimize a parameter), each mapreduce job must reload the data from persistent storage (e,g, HDFS), incurring a significant performance penalty
  • Apache Spark was developed in response to the limitation of Mapreduce and is a distributed computing system designed for big data processing and analytics. Known for its speed and ease of use, Spark can run programs up to 100x faster than Hadoop MapReduce in memory
  • Unlike MapReduce which upuport a map stage and a reduce stage, Spark supports DAG with many stages such as Stage 1 (parallelize, filter,
    map); Stage 2 (reducedByKey, map); Stage 3 …

Advantages

  • Unified Platform: Spark integrates batch processing, real-time processing, machine learning, graph processing, and interactive query analysis.
  • In-Memory Computation: it has the ability to process data in-memory, reducing the need for time-consuming disk I/O operations
  • Compute Cluster: Spark can run on a distributed cluster of machines, spreading data and computations to process data in parallel
  • Fault Tolerance: Despite its in-memory nature, Spark ensures data reliability and fault tolerance through lineage information. Even if a task fails, Spark can recompute lost data using the original transformations, ensuring that no data is permanently lost.
  • Language Flexibility: Spark supports multiple programming languages (R, python, Scala…)
  • Integration: Spark can easily integrate with popular big data tools and frameworks (HDFS, Cassandra, HBase, and Amazon S3 for data storage,
    as well as Hadoop’s YARN and Kubernetes for cluster management)

Spark Components

  • Spark Core: The foundational engine for data distribution and I/O capabilities. It provides essential functionalities like task scheduling, memory management, and fault recovery.
  • Spark SQL: Enables querying data via SQL as well as a DataFrame API. It supports various data sources, including Hive, Avro, Parquet, ORC,
    JSON, JDBC, and more.
  • Spark Streaming: Allows for real-time data processing by ingesting data in mini-batches
  • Spark MLlib: Machine learning library
  • Spark GraphX: For graph computation and graph-parallel computation
  • Cluster Manager: Spark supports standalone native clustering, Apache Mesos, and Hadoop YARN as cluster managers

Architecture

  • Driver Program: Central control node that runs the main function and creates the SparkContext (the code). It’s responsible for executing the
    application on the cluster and coordinating tasks
  • Executor: A distributed agent responsible for executing tasks. Each worker node in the cluster can run multiple executors
  • Task: A unit of work sent by the driver to the executor
  • Cluster Manager: An external service responsible for acquiring resources on the cluster. As mentioned, Spark supports standalone, Mesos, and YARN.

SparkContext is the entry point to all of the components of Apache Spark. A spark application is initiated just when the spark context is created.

The distributed components in a Spark Cluster (Executor and Cluster Manager) are accesses by the Driver through a SparkSession: it is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet. “The SparkContext is a singleton and can only be created once in a Spark application. The SparkSession, on the other hand, can be created multiple times within an application” (Spark code hub)

Deployment mode

Storage and partitioning

  • Actual physical data is distributed across storage as partitions residing in either HDFS or cloud storage. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction — as a DataFrame in memory
  • Each Spark executor is preferably allocated by a task that requires it to read the partition closest to it in the network, observing data locality

Data structure

RDD

  • RDD (Resilient Distributed dataset) is the fundamental data structure of Spark, representing an immutable distributed collection of objects.
  • Immutable: once created, an RDD never changes
  • Distributed: RDD is a logical unit from the point of view of the programmer, but internally Spark splits it in smaller partitions that are spread all around the cluster
  • Resilient: an RDD can be reconstructed in the case some parts of it are lost
  • Parallel: RDDs can be processed in parallel

There are 3 types of RDD: simple RDD, Key/value RDD where each record is a pair of two values and “double” RDD where each record contains only
number. The last 2 have their own functions such as grouping per key or computing variance.

Creating a RDD

  • from a dataset stored in a file system like HDFS. Number of partitions can be set by the user
  • by parallelizing an existing collection of data through the parallelize function which split data into n partitions
  • by parallel transformation from another RDD

Transformations on RDD

It always produce a new RDD from an existing one. Example of functions:

  • map(func): apply a function to each element of the RDD
  • filter(func): returns a new RDD after filtering elements
  • flatMap(func): like the function map() but each input item can be mapped to zero or more output items
  • collect(): return all the elements of the dataset as an array. Use with caution as it can overflow memory for large RDD
  • first(): see first element of the RDD
  • reduce(func): aggregates the elements of the RDD using a function
  • cache(): cache the RDD in memory — unpersist() to remove the RDD from cache

Dataframes

Dataframes are built on top of RDD and represent a distributed collection of data organized into named columns. They are like memory partitioned tables

  • Conceptually equivalent to a table in a database or a data frame in R or Python.
  • Can be created from various sources including structured data files, Hive tables, external databases, or existing RDDs.
  • Allows operations like filtering, aggregation, and more using SQL-like syntax.

Datasets
Since Spark 2.0 the Datadrame and dataset API are unified as Structured APIs. There is no dataset in Pyspark, but in Scala

  • Provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the advantages of DataFrames (optimizations).
  • Allows data to be stored as objects and processed with functional transformations.
  • Offers both compile-time type safety and runtime optimizations.

SparkSQL and logical plan

A module of the Apache Spark ecosystem designed for querying using SQL.

A Logical Plan refers to an abstract of all transformation steps that need to be executed. So when a SQL query is parsed, the first representation created is the Logical Plan. There are two phases of logical plans in SparkSQL

  1. Unresolved Logical Plan: At this stage, the plan is created from the parsed SQL query, but the table names and column names are not validated against the catalog. Once validation is done, the plan transitions to a “Resolved Logical Plan:” phase where all the identifiers (table names, column names) are resolved to database objects.
  2. Logical Plan Optimization: The Catalyst optimizer applies a series of rule-based transformations on the Resolved Logical Plan to produce an Optimized Logical Plan. Some of these transformations include constant folding, predicate pushdown, and Boolean expression simplification.
  3. Physical Plan: The Optimized Logical Plan is then converted into a Physical Plan using cost-based optimization. This involves generating multiple candidates, evaluating them based on their cost and picking the most optimal one.
  4. Code Generation: Once the physical plan is finalized, Spark generates bytecode to run on each node of the cluster. This bytecode is optimized for execution, ensuring tasks run as quickly as possible.
  5. Visualizing and Understanding Plans: Spark provides tools to visualize both logical and physical plans

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Marc Deveaux
Marc Deveaux

No responses yet

Write a response