top

Search

Apache Spark Tutorial

IntroductionIn this section we will look at a concrete example of an RDD transformation function and try to see the output by executing it on the Spark shell.We have seen above the functions we can use with RDDs. These could be Transformations which produce another RDD or Actions which produce anything other than RDDs and send the result to the Driver or write to the disk or stable storage.Implementations of RDD Transformations and Actions with an example:Let us look at a concrete example of executing RDD transformation and action on real data. There are many examples available in Scala, Python and Java which are readily available with Apache Spark installation and they can be executed on the Spark shell. The examples are available in Spark Github at: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examplesProcedure for executing [the example]: All of these examples can be executed by submitting the examples.jar provided with Spark installation. We can also execute these interactively on the Spark shell. Let us execute a simple one Word Count example on the Spark shell to understand in detail.Open Spark-Shell: The first step is to open the spark-shell on your machine where Spark is installed. Please execute the following command on the command line> spark-shellThis should open the Spark shell as below:Create an RDD: The next step is to create an RDD by reading a text file for which we are going to count the words.I have a file called “Spark.txt”. You can similarly have any .txt file and note the location. The first step is to create an RDD by reading the file as below:Execute Word count Transformation:  The next step is to execute the steps of the word count transformationsEach of the lines in the file is split into words using flatMap RDD transformation. flatMap applies a function that returns a sequence for each element in the list, and it flattens the results into the original list.Each word is read and key-value pairs are used to create the map transformation. This assigns the value ‘1’ to each of the word-keys.In the last step the values of matching keys are added to get the final count of each of the words using reduceByKey function.Please note that I have executed .collect() step only for demonstration purpose to show the intermediate to get the understanding better. This is not required in actual programming.Current RDD: In our example above we have different RDDs at the different steps. If we want to know about the current RDD, we can execute the following command and get more details about the RDD.> counts.toDebugStringThis gives the whole dependencies of the RDD for debugging purposes.Caching the Transformations: If we look at the 3 step execution of our word count example in detail, we note that each time I executed .collect(), the execution started from reading the file. So every time an action was called, it re-computed all the steps in my execution which is not what we would like. So we can avoid this by persisting or caching the RDD. This can be done by persist or cache methods. This caches the RDD in memory after the action is called and then the next iterative step will not re-compute the same steps but will use the cache and will perform better.Applying the Action: As we already know, all Spark transformations are executed only when an action is called. This results into the actual computation of the whole dependencies and gets the result for the computation.We can execute the following to save our output.Checking the Output: We can check the output of our program by opening another terminal and running the command:> ls -l /Users/home/Downloads/output/Output: The output can be seen by running the below command on the 2 part files created as the output of our program.This is not the full output as my screen could not capture the full terminal window as results span across windows due to bigger input file.The output also contains _SUCCESS file which shows that the program execution is completed successfully. This comes from the similar MapReduce concept in Hadoop.ConclusionWe saw above how to work with a transformation on the Spark shell. Working with other set of transformations and actions is very similar.
logo

Apache Spark Tutorial

Apache Spark Programming with RDD

Introduction

In this section we will look at a concrete example of an RDD transformation function and try to see the output by executing it on the Spark shell.

We have seen above the functions we can use with RDDs. These could be Transformations which produce another RDD or Actions which produce anything other than RDDs and send the result to the Driver or write to the disk or stable storage.

Implementations of RDD Transformations and Actions with an example:

Let us look at a concrete example of executing RDD transformation and action on real data. There are many examples available in Scala, Python and Java which are readily available with Apache Spark installation and they can be executed on the Spark shell. The examples are available in Spark Github at: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

Procedure for executing [the example]: All of these examples can be executed by submitting the examples.jar provided with Spark installation. We can also execute these interactively on the Spark shell. Let us execute a simple one Word Count example on the Spark shell to understand in detail.

Open Spark-Shell: The first step is to open the spark-shell on your machine where Spark is installed. Please execute the following command on the command line

> spark-shell

This should open the Spark shell as below:

Spark Code

Create an RDD: The next step is to create an RDD by reading a text file for which we are going to count the words.

I have a file called “Spark.txt”. You can similarly have any .txt file and note the location. The first step is to create an RDD by reading the file as below:

Create an RDD

Execute Word count Transformation:  The next step is to execute the steps of the word count transformations

  • Each of the lines in the file is split into words using flatMap RDD transformation. flatMap applies a function that returns a sequence for each element in the list, and it flattens the results into the original list.

Execute Word count Transformation

  • Each word is read and key-value pairs are used to create the map transformation. This assigns the value ‘1’ to each of the word-keys.

Execute Word count Transformation

  • In the last step the values of matching keys are added to get the final count of each of the words using reduceByKey function.

Execute Word count Transformation

Please note that I have executed .collect() step only for demonstration purpose to show the intermediate to get the understanding better. This is not required in actual programming.

Current RDD: In our example above we have different RDDs at the different steps. If we want to know about the current RDD, we can execute the following command and get more details about the RDD.

> counts.toDebugString

This gives the whole dependencies of the RDD for debugging purposes.

Current RDD

Caching the Transformations: If we look at the 3 step execution of our word count example in detail, we note that each time I executed .collect(), the execution started from reading the file. So every time an action was called, it re-computed all the steps in my execution which is not what we would like. So we can avoid this by persisting or caching the RDD. This can be done by persist or cache methods. This caches the RDD in memory after the action is called and then the next iterative step will not re-compute the same steps but will use the cache and will perform better.

Current RDD Caching the Transformations

Current RDD Caching the Transformations

Applying the Action: As we already know, all Spark transformations are executed only when an action is called. This results into the actual computation of the whole dependencies and gets the result for the computation.

We can execute the following to save our output.

Applying the Action

Checking the Output: We can check the output of our program by opening another terminal and running the command:

> ls -l /Users/home/Downloads/output/

Applying the Action

Output: The output can be seen by running the below command on the 2 part files created as the output of our program.

Spark Code

This is not the full output as my screen could not capture the full terminal window as results span across windows due to bigger input file.

The output also contains _SUCCESS file which shows that the program execution is completed successfully. This comes from the similar MapReduce concept in Hadoop.

Conclusion

We saw above how to work with a transformation on the Spark shell. Working with other set of transformations and actions is very similar.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

alvi

I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.

alvi

I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!

alvi

I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.

sandipan mukherjee

yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.

liana

thanks for info