top
upGrad KnowledgeHut SkillFest Sale!

Search

Apache Spark Tutorial

IntroductionWe have already worked with Spark shell and saw how to write transformations. But we need to understand how to execute a Spark application. Let us look at deployment of a sample Spark application in detail.Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.If the code has a dependency on other projects, we need to package those dependent projects with the Spark code so that the dependent code is also distributed on the Spark cluster. To package all these we need to create an assembly jar (or “uber” jar) with our Spark code and the dependencies. We can use SBT or Maven to create the assembly jar. We do not need to bundle the Spark and Hadoop jars in this “uber” jar but these can be listed as provided jars since these will be provided by the cluster managers during the runtime of the application. When the assembly jar is ready we can spark-submit the assembled jar.A common spark-submit command would look like this:./bin/spark-submit \   --class <main-class> \   --master <master-url> \   --deploy-mode <deploy-mode> \   --conf <key>=<value> \   ... # other options   <application-jar> \ [application-arguments]  Some of the commonly used options are:--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.application-arguments: Arguments passed to the main method of your main class, if anyExplanation with exampleLet's us look at running an example spark application. We will take the example of the SparkPi application provided in the spark examples as part of the installation.The Scala code looks like below:package org.apache.spark.examples  import scala.math.random  import org.apache.spark.sql.SparkSession  /** Computes an approximation to pi */  object SparkPi {    def main(args: Array[String]) {  val spark = SparkSession     .builder     .appName("Spark Pi")     .getOrCreate()  val slices = if (args.length > 0) args(0).toInt else 2  val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow  val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>  val x = random * 2 - 1  val y = random * 2 - 1     if (x*x + y*y <= 1) 1 else 0  }.reduce(_ + _)  println(s"Pi is roughly ${4.0 * count / (n - 1)}")  spark.stop()    }  }Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.Executing StepsPlease locate the Spark examples jar which comes with the Spark installation. On my installation, it is at this location./usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jarThe class is called SparkPi. Open the command prompt and execute the below command. This should run the SparkPi example and compute the output. This is running through the Spark’s default resource manager as we are just running on local. There are multiple options that can be specified in the spark-submit command depending on the environment you are operating and the resource manager used. For simplicity, I have used the Spark’s default.spark-submit  --class org.apache.spark.examples.SparkPi  /usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar 10You can see the output printed asPi is roughly 3.1396071396071394You can look at the options which can be used in the Spark-submit command here in the apache-spark official website.ConclusionIn this module we understood how to deploy a Spark application and what are some of the different configuration parameters. We cannot go through every parameter, but we went through the most important ones.
logo

Apache Spark Tutorial

Apache Spark Deployment

Introduction

We have already worked with Spark shell and saw how to write transformations. But we need to understand how to execute a Spark application. Let us look at deployment of a sample Spark application in detail.

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

If the code has a dependency on other projects, we need to package those dependent projects with the Spark code so that the dependent code is also distributed on the Spark cluster. To package all these we need to create an assembly jar (or “uber” jar) with our Spark code and the dependencies. We can use SBT or Maven to create the assembly jar. We do not need to bundle the Spark and Hadoop jars in this “uber” jar but these can be listed as provided jars since these will be provided by the cluster managers during the runtime of the application. When the assembly jar is ready we can spark-submit the assembled jar.

A common spark-submit command would look like this:

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
[application-arguments] 
Some of the commonly used options are:
  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

Explanation with example

Let's us look at running an example spark application. We will take the example of the SparkPi application provided in the spark examples as part of the installation.

The Scala code looks like below:

package org.apache.spark.examples 
import scala.math.random 
import org.apache.spark.sql.SparkSession 
/** Computes an approximation to pi */ 
object SparkPi { 
  def main(args: Array[String]) { 
val spark = SparkSession 
   .builder 
   .appName("Spark Pi") 
   .getOrCreate() 
val slices = if (args.length > 0) args(0).toInt else 2 
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow 
val count = spark.sparkContext.parallelize(1 until n, slices).map { i => 
val x = random * 2 - 1 
val y = random * 2 - 1 
   if (x*x + y*y <= 1) 1 else 0 
}.reduce(_ + _) 
println(s"Pi is roughly ${4.0 * count / (n - 1)}") 
spark.stop() 
  } 
}

Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.

Executing Steps

Please locate the Spark examples jar which comes with the Spark installation. On my installation, it is at this location.

/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar

The class is called SparkPi. Open the command prompt and execute the below command. This should run the SparkPi example and compute the output. This is running through the Spark’s default resource manager as we are just running on local. There are multiple options that can be specified in the spark-submit command depending on the environment you are operating and the resource manager used. For simplicity, I have used the Spark’s default.

spark-submit  --class org.apache.spark.examples.SparkPi  /usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar 10

Spark Code

Spark Code

You can see the output printed as

Pi is roughly 3.1396071396071394

You can look at the options which can be used in the Spark-submit command here in the apache-spark official website.

Conclusion

In this module we understood how to deploy a Spark application and what are some of the different configuration parameters. We cannot go through every parameter, but we went through the most important ones.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

alvi

I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.

alvi

I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!

alvi

I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.

sandipan mukherjee

yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.

liana

thanks for info