In the context of Apache Spark, worker nodes and executors are not the same; they are different components of a Spark cluster.
- Worker Nodes:
- Worker nodes are also known as slave nodes in a Spark cluster.
- These nodes are responsible for running the tasks assigned by the Spark driver program.
- Worker nodes manage the resources (CPU, memory, etc.) and execute the tasks on behalf of the driver.
- They can be part of a cluster managed by a cluster manager such as Apache Mesos, Hadoop YARN, or Spark’s built-in standalone cluster manager.
- Executors:
- Executors are a specific component of a worker node. Each worker node can have multiple executors.
- Executors are responsible for executing the tasks within a Spark application. They run in separate JVMs.
- Executors are created and managed by the Spark cluster manager (e.g., YARN, Mesos, or Spark’s standalone cluster manager) and are allocated resources from the worker nodes.
- Multiple executors can run on a single worker node, and they are used to parallelize the processing of tasks within a Spark application.
Working Process :
- Let’s say a user submits a job using “spark-submit”.
- “spark-submit” will in-turn launch the Driver which will execute the main() method of our code.
- Driver contacts the cluster manager and requests for resources to launch the Executors.
- The cluster manager launches the Executors on behalf of the Driver.
- Once the Executors are launched, they establish a direct connection with the Driver.
- The driver determines the total number of Tasks by checking the Lineage.
- The driver creates the Logical and Physical Plan.
- Once the Physical Plan is generated, Spark allocates the Tasks to the Executors.
- Task runs on Executor and each Task upon completion returns the result to the Driver.
- Finally, when all Task is completed, the main() method running in the Driver exits, i.e. main() method invokes sparkContext.stop().
- Finally, Spark releases all the resources from the Cluster Manager.