High-level Architecture of Spark
A Spark application involves five key entities: a driver program, a cluster manager, workers, executors, and tasks.
Workers: A worker provides CPU, memory, and storage resources to a Spark application. The workers run a Spark application as distributed processes on a cluster of nodes.
Cluster Managers: Spark uses a cluster manager to acquire cluster resources for executing a job. A cluster manager manages computing resources across a cluster of worker nodes. It provides low-level scheduling of cluster resources across applications. It enables multiple applications to share cluster resources and run on the same worker nodes.
Spark currently supports three cluster managers: standalone, Mesos, and YARN. Mesos and YARN allow to run Spark and Hadoop applications simultaneously on the same worker nodes.
Driver Programs: A driver program is an application that uses Spark as a library. It provides the data processing code that Spark executes on the worker nodes. A driver program can launch one or more jobs on a Spark cluster.
Executors: An executor is a JVM (Java virtual machine) process that Spark creates on each worker for an application. It executes application code concurrently in multiple threads. It can also cache data in memory or disk. An executor has the same lifespan as the application for which it is created. When a Spark application terminates, all the executors created for it also terminate.
Tasks: A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor on a worker node. Each task performs some computations to either return a result to a driver program or partition its output for shuffle. Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amount of parallelism is determined by the number of partitions.
Statlearner
Statlearner