Introduction
We have already worked with Spark shell and saw how to write transformations. But we need to understand how to execute a Spark application. Let us look at deployment of a sample Spark application in detail.
Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.
If the code has a dependency on other projects, we need to package those dependent projects with the Spark code so that the dependent code is also distributed on the Spark cluster. To package all these we need to create an assembly jar (or “uber” jar) with our Spark code and the dependencies. We can use SBT or Maven to create the assembly jar. We do not need to bundle the Spark and Hadoop jars in this “uber” jar but these can be listed as provided jars since these will be provided by the cluster managers during the runtime of the application. When the assembly jar is ready we can spark-submit the assembled jar.
A common spark-submit command would look like this:
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] Some of the commonly used options are:
Explanation with example
Let's us look at running an example spark application. We will take the example of the SparkPi application provided in the spark examples as part of the installation.
The Scala code looks like below:
package org.apache.spark.examples import scala.math.random import org.apache.spark.sql.SparkSession /** Computes an approximation to pi */ object SparkPi { def main(args: Array[String]) { val spark = SparkSession .builder .appName("Spark Pi") .getOrCreate() val slices = if (args.length > 0) args(0).toInt else 2 val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow val count = spark.sparkContext.parallelize(1 until n, slices).map { i => val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y <= 1) 1 else 0 }.reduce(_ + _) println(s"Pi is roughly ${4.0 * count / (n - 1)}") spark.stop() } }
Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.
Executing Steps
Please locate the Spark examples jar which comes with the Spark installation. On my installation, it is at this location.
/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar
The class is called SparkPi. Open the command prompt and execute the below command. This should run the SparkPi example and compute the output. This is running through the Spark’s default resource manager as we are just running on local. There are multiple options that can be specified in the spark-submit command depending on the environment you are operating and the resource manager used. For simplicity, I have used the Spark’s default.
spark-submit --class org.apache.spark.examples.SparkPi /usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar 10
You can see the output printed as
Pi is roughly 3.1396071396071394
You can look at the options which can be used in the Spark-submit command here in the apache-spark official website.
Conclusion
In this module we understood how to deploy a Spark application and what are some of the different configuration parameters. We cannot go through every parameter, but we went through the most important ones.
I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.
I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!
I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.
yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.
thanks for info
Leave a Reply
Your email address will not be published. Required fields are marked *