top
upGrad KnowledgeHut SkillFest Sale!

Search

Apache Spark Tutorial

IntroductionIn this section we will look into some of the advanced concepts of Apache Spark like RDD (Resilient Distributed Dataset), which is the building block of Spark processing. We will also look into concepts like transformations and actions which are basic units of Spark processing.We have already seen how to create SparkContext and RDDs from SparkContext. Let us go deeper into how a Spark program gets executed and its internals. Spark defines an RDD interface with the properties which must be implemented by all the RDD types. These properties are mainly needed to know RDD dependencies and data locality which is used by the Spark engine. There are mainly five properties which correspond to the following methods:  partitions() -  returns an array of partition objects that constitute the parts of the distributed dataset.   iterator(p, parentIters) - this method computes the partition p given iterators for each of its parent partitions. This method is internally used by Spark while executing actions.  dependencies() - returns a sequence of the dependency objects. This is used by scheduler to connect the RDDs with dependencies.partitioner() -  returns a Scala option type that contains a partitioner object if the RDD has function between datapoint and partitioner associated with it.  preferredLocations(p) - this returns the information about data locality of the partition,p.  There are two types of functions defined on RDDs. They are called transformations and actions. Transformations are functions which return another RDD and actions are functions which return something which is not RDD. As we already know Spark does lazy evaluation, only when an action is invoked, Spark will evaluate all the transformations and finally produce the output.   Also, there are two types of transformations in Spark called the narrow and wide transformations.The difference between the narrow and wide transformations has very significant ramifications on how the tasks are executed and their performance. Narrow transformations are operations which depend on only one or a known set of partitions in the parent RDD. So these can be evaluated without any information about other partitions of the RDD. Some examples of narrow transformations are map, filter, flatMap etc, while wide transformations cannot be executed on arbitrary rows but need the data to be partitioned in a particular way. Wide transformation examples are sort, reduceByKey, groupByKey, join and more which calls for repartition.In Spark’s execution paradigm, Spark application does not evaluate anything to produce an output, but just builds an execution graph called DAG, until the driver program calls an action and Spark job is then launched. Each job consists of stages which are steps in the transformation of the data needed to materialize the final RDD. Each stage is a combination of tasks which are the smallest unit of execution performed parallelly on the executors launched on the cluster. Each task in one stage executes the same code on different pieces of data and one task cannot execute on more than one executor. Each task is allocated a number of slots for running the tasks and may run tasks parallelly in its lifetime. The number of tasks in each stage is the same as the number of partitions in the output RDD of that stage.The below diagram shows the relation between the Jobs, Stages and Tasks for an application.Let us see an example program to calculate the frequency of each word in a text file.  scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> val lines = sc.textFile("/Users/home/Downloads/Spark.txt") lines: org.apache.spark.rdd.RDD[String] = /Users/home/Downloads/Spark.txt MapPartitionsRDD[1] at textFile at <console>:25 scala> val counts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26 scala> counts.collect()Looking at the Spark Web UI we can see the Job  and Stages created for the Application.The details of each stage and the DAG can be viewed by clicking on the Stage links as below:ConclusionIn this section we understood the lowest level abstraction in Spark i.e. RDDs. This will help us write our application processes in terms of transformations and actions.
logo

Apache Spark Tutorial

Advanced Apache Spark Internals and Spark Core

Introduction

In this section we will look into some of the advanced concepts of Apache Spark like RDD (Resilient Distributed Dataset), which is the building block of Spark processing. We will also look into concepts like transformations and actions which are basic units of Spark processing.

We have already seen how to create SparkContext and RDDs from SparkContext. Let us go deeper into how a Spark program gets executed and its internals. 

Spark defines an RDD interface with the properties which must be implemented by all the RDD types. These properties are mainly needed to know RDD dependencies and data locality which is used by the Spark engine. There are mainly five properties which correspond to the following methods:  

  • partitions() -  returns an array of partition objects that constitute the parts of the distributed dataset.   
  • iterator(p, parentIters) - this method computes the partition p given iterators for each of its parent partitions. This method is internally used by Spark while executing actions.  
  • dependencies() - returns a sequence of the dependency objects. This is used by scheduler to connect the RDDs with dependencies.
  • partitioner() -  returns a Scala option type that contains a partitioner object if the RDD has function between datapoint and partitioner associated with it.  
  • preferredLocations(p) - this returns the information about data locality of the partition,p.  

There are two types of functions defined on RDDs. They are called transformations and actions. Transformations are functions which return another RDD and actions are functions which return something which is not RDD. As we already know Spark does lazy evaluation, only when an action is invoked, Spark will evaluate all the transformations and finally produce the output.   

Also, there are two types of transformations in Spark called the narrow and wide transformations.

The difference between the narrow and wide transformations has very significant ramifications on how the tasks are executed and their performance. Narrow transformations are operations which depend on only one or a known set of partitions in the parent RDD. So these can be evaluated without any information about other partitions of the RDD. Some examples of narrow transformations are map, filter, flatMap etc, while wide transformations cannot be executed on arbitrary rows but need the data to be partitioned in a particular way. Wide transformation examples are sort, reduceByKeygroupByKey, join and more which calls for repartition.

Narrow Transformation and Mide Transformation

In Spark’s execution paradigm, Spark application does not evaluate anything to produce an output, but just builds an execution graph called DAG, until the driver program calls an action and Spark job is then launched. Each job consists of stages which are steps in the transformation of the data needed to materialize the final RDD. Each stage is a combination of tasks which are the smallest unit of execution performed parallelly on the executors launched on the cluster. Each task in one stage executes the same code on different pieces of data and one task cannot execute on more than one executor. Each task is allocated a number of slots for running the tasks and may run tasks parallelly in its lifetime. The number of tasks in each stage is the same as the number of partitions in the output RDD of that stage.

The below diagram shows the relation between the Jobs, Stages and Tasks for an application.

Jobs, Stages and Tasks for an application.

Let us see an example program to calculate the frequency of each word in a text file.  

scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD

scala> val lines = sc.textFile("/Users/home/Downloads/Spark.txt")
lines: org.apache.spark.rdd.RDD[String] = /Users/home/Downloads/Spark.txt MapPartitionsRDD[1] at textFile at <console>:25

scala> val counts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26

scala> counts.collect()

Spark Code

Looking at the Spark Web UI we can see the Job  and Stages created for the Application.

Spark Application

Spark Application

The details of each stage and the DAG can be viewed by clicking on the Stage links as below:

DAG(Details for each stage)

DAG(Details for each stage)

Conclusion

In this section we understood the lowest level abstraction in Spark i.e. RDDs. This will help us write our application processes in terms of transformations and actions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

alvi

I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.

alvi

I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!

alvi

I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.

sandipan mukherjee

yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.

liana

thanks for info