Apache Spark is a distributed computing platform which handles and efficiently process “Big Data”. Spark supports “In-memory” processing ( 10x faster than Hadoop).

Key Features:

  • Fast Processing
  • Supports both Batch and Real-time processing
  • Flexible
  • Better Analytics
  • Compatible with Hadoop

How does the Spark execute our programs on a cluster?

Driver:The driver is the process where the main method runs. Driver is wholly responsible for Analysing, Distributing, Scheduling and monitoring the task divided between executors. The driver also keeps all the information during the lifetime of the application.

Executer: Executors are the worker node process and they are only responsible for execution of task/code assigned to them and then revert back the status/output to the driver node.

How it really works?

Apache Spark Architecture

Step 1: The driver Program in Spark Architecture calls the main application and launch the Spark Context. 

Step 2: Then, the driver node translates user-submitted jobs into actually required job type format.

Step 3: Spark Context communicate with Cluster Managers to allocate the resources and divide the task into multiple executors. 

Spark context can work with many cluster managers like:

  • YARN (Yet Another Resource Negotiator)
  • Standalone Cluster Manager
  • Mesos
  • Kubernetes

Step 4: After completion of the allocated tasks, executers directly communicate with the driver. The lifetime of executors is the same as that of spark application.

Let’s see how do we execute programs on a Spark cluster?

Spark support “Master-Slave Architecture” like in Hadoop MapReduce. It uses two ways to execute the programs mentioned below:

  • Interactive Client ( Pyspark, Scala, Notebooks)
  • Submit a job (Spark submit utility)

MODES OF DEPLOYMENT OF A SPARK APPLICATION:

CLIENT MODE

In Client Mode, the driver runs locally on the client machine and executors on the cluster. It is mainly used for debugging and for the learning/developing stage as it throws the output on the terminal.

How the process flows?

Step 1: On the Client Machine, the client uses Spark Shell/Pyspark to interact with the Spark Cluster. Later it launches Spark Context.

Step 2: Next, a request goes from the local machine (Driver) to the Resource Manager to create a YARN application

Step 3: The YARN Resource Manager creates an Application Master which acts as an Executor Launcher.

Step 4: AM reaches out to the Resource manager for more containers.

Client Mode Architecture

Step 5: The Resource Manager allocates new containers and then the Application master starts an executer into each container. 

Step 6: After the initial setup, the executors directly communicate with the driver and get destroyed after task completion.

CLUSTER MODE

Step 1: The client submits the packaged application using the spark-submit tool to the Cluster.

Step 2: The spark-submit utility sends a YARN application request to the YARN resource manager.

Step 3: The Resource Manager starts an Application Master. Then, a driver starts in the AM Container. There is no dependency on the client machine.

Step 4: The driver asks the Resource Manager for more containers.

Step 5: The Resource Manager will allocate new containers.

Cluster Mode Architecture

Step 6: The Driver starts executer in each container. 

Step 7: Executors perform their specific tasks and revert back the output to the driver in AM Container.

LOCAL MODE

When you don’t have much infrastructure to create a multinode cluster but want to set up a Spark environment for learning, we can use Local mode. It needs a JVM (Java Virtual Machine) where both driver and executer work.

Local Mode Architecture

That’s all I have and thanks a lot for reading. Please let us know if any corrections/suggestions. Please do share and comments if you like the post. Thanks in advance…

Thanks, Gargi Gupta for helping us to grow day by day. She is an expert in BigData and loves to solve competitive programming.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert