top

Search

Apache Spark Tutorial

IntroductionIn this section we will look at the Apache Spark architecture in detail, and also try to understand how it works internally. We will also understand some of the main technical terms associated with Spark’s architecture like Driver, Executor, Master, Cluster and Worker.Apache Spark Architectural Concepts, Key Terms, and Keywords Now since we have a fair understanding of Spark and its main features, let us dive deeper into the architecture of Spark and understand the anatomy of a Spark application. We know Spark is a distributed, cluster computing framework and Spark works in a master-slave fashion. Whenever we need to execute a Spark program we need to perform an operation called “spark-submit”. We will go over the details of what this means in later sections. But to simply understand, spark-submit is like calling the main program as we do in Java. On performing a “spark-submit” on a cluster, a master and one or more slaves are launched to accomplish the task written in the Spark program. There are different modes of launching a Spark program like standalone, client, cluster mode. We will see these options in detail later.  Spark ClusterTo visualize the architecture of a Spark cluster, let us look at the below diagram and understand each component and its functions.  Whenever we want to run an application we need to perform a spark-submit with some parameters. Say we submitted an Application A, this leads to the creation of one Driver process for A which is usually the Master and one or more Executors on the Worker nodes. This entire set of a Driver and Executors is exclusive for the Application A. Now say we want to run another application B and perform a spark-submit, another set of one Driver and few Executors are started which are totally independent of Driver and Executors for Application A. Even though both the Drivers might run on the same machine on the cluster, they are mutually exclusive. Same applies for Executors. So, a Spark cluster consists of a Master Node and Worker Nodes which can be shared across multiple applications, but each application runs mutually exclusive of each other.   When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an Application Master where the Driver runs and the client can go away once the application is started. In client mode, the Driver keeps running on the client and Application Master only requests resources from the YARN.  To launch a Spark application in cluster mode:$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options]  <app jar> [app options] To launch in client mode:  $ ./bin/spark-shell --master yarn --deploy-mode clientSpark MasterWhen we run Spark in a Standalone mode, a Master node first needs to be started which can be done by executing:   ./sbin/start-master.sh This creates a Master node on the machine where the command is executed. Once the master node starts, it gives the Spark URL of the form: spark://HOST: PORT which can be used to start the Worked Nodes. Spark WorkerSeveral worker nodes can be started on different machines on the cluster using the command:  ./sbin/start-slave.sh <master-spark-URL>The master’s web UI can be accessed using: http://localhost:8080 or http://server:8080 We will see these scripts in detail in the Spark Installation section.Spark Driver: Driver is the process that runs the main() function of the application and also creates the SparkContext. A driver is a separate JVM process and it is the responsibility of the driver to analyze, distribute, schedule and monitor work across the worker nodes. Each application launched or submitted on a cluster will have its own separate Driver running, and even if there are multiple applications running simultaneously on a cluster, these Drivers will not talk to each other in any way. The Driver program also plays host to a bunch of processes which are part of the application like   SparkEnv DAGScheduler TaskScheduler SparkUI The Spark application which we want to run is instantiated within the Spark Driver.  Spark Executor: The Driver program launches the tasks which run on individual worker nodes. These tasks are what operate on a subset of RDDs that are present on that node. These programs running on the worker nodes are called executors. The actual program written in your application gets executed by these executors. The Driver program after getting started interacts with the Cluster Manager (YARN, Mesos, Default) to spin off the resources on the Worker nodes and then assign the tasks to the executors. Tasks are the basic units of execution.  SparkSession and SparkContext: SparkContext is the heart of any Spark application. The Sparkcontext can be thought of as a bridge to the Spark environment and all that it has to offer from your program. SparkContext is used as the entry point to kickstart the application. SparkContext can be used to create RDDs like below:  val conf =newSparkConf().setAppName(“FirstSpark”).setMaster(master) valsc = newSparkContext(conf) val data =Array(1,2,3,4,5) valdistData=sc.parallelize(data) distData is the RDD which gets created using SparkContext.   SparkSession is a simplified entry point into Spark application and it also encapsulates the SparkContext. SparkSession is introduced in Spark 2.x. Prior to this, Spark had different contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if running Spark on Hive, StreamingContext, etc. SparkSession makes it simple so there is no confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is instantiated using a builder and it is an important component of Spark 2.0.  valspark = SparkSession.builder()  .master("local")  .appName("SparkSessionExample")  .getOrCreate()  SparkSession.builder()  In the Spark interactive Scala shell, the SparkSession/Context is automatically provided by the environment and it is not required to manually create it. But in standalone applications, we need to explicitly create it.  Spark Deployment Modes Cheat SheetModeDriverWhen To UseClient ModeDriver runs on the machine from where Spark job is submittedWhen job submitting machine is very near to the Cluster, there is no network latency. Failure chances are high due to network issuesCluster ModeDriver is launched on any of the machines on the Cluster not on the Client machine where job is submittedWhen job submitting machine is far from the cluster, failure chances are less due to network issues.Standalone ModeDriver will be launched on the machine where master script is startedUseful for development and testing purpose, not recommended for Production grade applications.ConclusionIn this section we have understood the internals of Apache Spark which are very important, as we will have to look into many of these processes when we work with Spark in a production environment. Most of this understanding comes in handy while debugging and tuning our applications.
logo

Apache Spark Tutorial

Apache Spark Architecture

Introduction

In this section we will look at the Apache Spark architecture in detail, and also try to understand how it works internally. We will also understand some of the main technical terms associated with Spark’s architecture like Driver, Executor, Master, Cluster and Worker.

Apache Spark Architectural Concepts, Key Terms, and Keywords 

Now since we have a fair understanding of Spark and its main features, let us dive deeper into the architecture of Spark and understand the anatomy of a Spark application. We know Spark is a distributed, cluster computing framework and Spark works in a master-slave fashion. Whenever we need to execute a Spark program we need to perform an operation called “spark-submit”. We will go over the details of what this means in later sections. But to simply understand, spark-submit is like calling the main program as we do in Java. On performing a “spark-submit” on a cluster, a master and one or more slaves are launched to accomplish the task written in the Spark program. There are different modes of launching a Spark program like standalone, client, cluster mode. We will see these options in detail later.  

Spark Cluster

To visualize the architecture of a Spark cluster, let us look at the below diagram and understand each component and its functions.  

Whenever we want to run an application we need to perform a spark-submit with some parameters. Say we submitted an Application A, this leads to the creation of one Driver process for A which is usually the Master and one or more Executors on the Worker nodes. This entire set of a Driver and Executors is exclusive for the Application A. Now say we want to run another application B and perform a spark-submit, another set of one Driver and few Executors are started which are totally independent of Driver and Executors for Application A. Even though both the Drivers might run on the same machine on the cluster, they are mutually exclusive. Same applies for Executors. So, a Spark cluster consists of a Master Node and Worker Nodes which can be shared across multiple applications, but each application runs mutually exclusive of each other.   

When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an Application Master where the Driver runs and the client can go away once the application is started. In client mode, the Driver keeps running on the client and Application Master only requests resources from the YARN.  

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options]
 <app jar> [app options] 

To launch in client mode:  

$ ./bin/spark-shell --master yarn --deploy-mode client

cluster mode

Spark Master

When we run Spark in a Standalone mode, a Master node first needs to be started which can be done by executing:   

./sbin/start-master.sh 

This creates a Master node on the machine where the command is executed. Once the master node starts, it gives the Spark URL of the form: spark://HOST: PORT which can be used to start the Worked Nodes. 

Spark Worker

Several worker nodes can be started on different machines on the cluster using the command:  

./sbin/start-slave.sh <master-spark-URL>

The master’s web UI can be accessed using: http://localhost:8080 or http://server:8080 

We will see these scripts in detail in the Spark Installation section.

Spark Worker

Spark Driver: Driver is the process that runs the main() function of the application and also creates the SparkContext. A driver is a separate JVM process and it is the responsibility of the driver to analyze, distribute, schedule and monitor work across the worker nodes. Each application launched or submitted on a cluster will have its own separate Driver running, and even if there are multiple applications running simultaneously on a cluster, these Drivers will not talk to each other in any way. The Driver program also plays host to a bunch of processes which are part of the application like   

  • SparkEnv 
  • DAGScheduler 
  • TaskScheduler 
  • SparkUI 

The Spark application which we want to run is instantiated within the Spark Driver.  

Spark Executor: The Driver program launches the tasks which run on individual worker nodes. These tasks are what operate on a subset of RDDs that are present on that node. These programs running on the worker nodes are called executors. The actual program written in your application gets executed by these executors. The Driver program after getting started interacts with the Cluster Manager (YARN, Mesos, Default) to spin off the resources on the Worker nodes and then assign the tasks to the executors. Tasks are the basic units of execution.  

SparkSession and SparkContext: SparkContext is the heart of any Spark application. The Sparkcontext can be thought of as a bridge to the Spark environment and all that it has to offer from your program. SparkContext is used as the entry point to kickstart the application. SparkContext can be used to create RDDs like below:  

  • val conf =newSparkConf().setAppName(“FirstSpark”).setMaster(master) 
  • valsc = newSparkContext(conf) 
  • val data =Array(1,2,3,4,5) 
  • valdistData=sc.parallelize(data) 

distData is the RDD which gets created using SparkContext.   

SparkSession is a simplified entry point into Spark application and it also encapsulates the SparkContext. SparkSession is introduced in Spark 2.x. Prior to this, Spark had different contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if running Spark on Hive, StreamingContext, etc. SparkSession makes it simple so there is no confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is instantiated using a builder and it is an important component of Spark 2.0.  

  • valspark = SparkSession.builder()  

.master("local")  
.appName("SparkSessionExample")  
.getOrCreate()  
SparkSession.builder()  

In the Spark interactive Scala shell, the SparkSession/Context is automatically provided by the environment and it is not required to manually create it. But in standalone applications, we need to explicitly create it.  

Spark Deployment Modes Cheat Sheet

Mode
Driver
When To Use
Client Mode
Driver runs on the machine from where Spark job is submitted
When job submitting machine is very near to the Cluster, there is no network latency. Failure chances are high due to network issues
Cluster Mode
Driver is launched on any of the machines on the Cluster not on the Client machine where job is submitted
When job submitting machine is far from the cluster, failure chances are less due to network issues.
Standalone Mode
Driver will be launched on the machine where master script is started
Useful for development and testing purpose, not recommended for Production grade applications.

Conclusion

In this section we have understood the internals of Apache Spark which are very important, as we will have to look into many of these processes when we work with Spark in a production environment. Most of this understanding comes in handy while debugging and tuning our applications.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

alvi

I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.

alvi

I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!

alvi

I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.

sandipan mukherjee

yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.

liana

thanks for info