Apache Spark is a distributed computing platform which handles and efficiently process “Big Data”. Spark supports “In-memory” processing ( 10x faster than Hadoop).
- Fast Processing
- Supports both Batch and Real-time processing
- Better Analytics
- Compatible with Hadoop
How does the Spark execute our programs on a cluster?
Driver:The driver is the process where the main method runs. Driver is wholly responsible for Analysing, Distributing, Scheduling and monitoring the task divided between executors. The driver also keeps all the information during the lifetime of the application.
Executer: Executors are the worker node process and they are only responsible for execution of task/code assigned to them and then revert back the status/output to the driver node.
How it really works?
Step 1: The driver Program in Spark Architecture calls the main application and launch the Spark Context.
Step 2: Then, the driver node translates user-submitted jobs into actually required job type format.
Step 3: Spark Context communicate with Cluster Managers to allocate the resources and divide the task into multiple executors.
Spark context can work with many cluster managers like:
- YARN (Yet Another Resource Negotiator)
- Standalone Cluster Manager
Step 4: After completion of the allocated tasks, executers directly communicate with the driver. The lifetime of executors is the same as that of spark application.
Let’s see how do we execute programs on a Spark cluster?
Spark support “Master-Slave Architecture” like in Hadoop MapReduce. It uses two ways to execute the programs mentioned below:
- Interactive Client ( Pyspark, Scala, Notebooks)
- Submit a job (Spark submit utility)
MODES OF DEPLOYMENT OF A SPARK APPLICATION:
In Client Mode, the driver runs locally on the client machine and executors on the cluster. It is mainly used for debugging and for the learning/developing stage as it throws the output on the terminal.
How the process flows?
Step 1: On the Client Machine, the client uses Spark Shell/Pyspark to interact with the Spark Cluster. Later it launches Spark Context.
Step 2: Next, a request goes from the local machine (Driver) to the Resource Manager to create a YARN application.
Step 3: The YARN Resource Manager creates an Application Master which acts as an Executor Launcher.
Step 4: AM reaches out to the Resource manager for more containers.
Step 5: The Resource Manager allocates new containers and then the Application master starts an executer into each container.
Step 6: After the initial setup, the executors directly communicate with the driver and get destroyed after task completion.
Step 1: The client submits the packaged application using the spark-submit tool to the Cluster.
Step 2: The spark-submit utility sends a YARN application request to the YARN resource manager.
Step 3: The Resource Manager starts an Application Master. Then, a driver starts in the AM Container. There is no dependency on the client machine.
Step 4: The driver asks the Resource Manager for more containers.
Step 5: The Resource Manager will allocate new containers.
Step 6: The Driver starts executer in each container.
Step 7: Executors perform their specific tasks and revert back the output to the driver in AM Container.
When you don’t have much infrastructure to create a multinode cluster but want to set up a Spark environment for learning, we can use Local mode. It needs a JVM (Java Virtual Machine) where both driver and executer work.
That’s all I have and thanks a lot for reading. Please let us know if any corrections/suggestions. Please do share and comments if you like the post. Thanks in advance…
Thanks, Gargi Gupta for helping us to grow day by day. She is an expert in BigData and loves to solve competitive programming.