Spark was developed by MatieZaharia in 2009 as a research project at UC Berkeley AMPLab, which focused on Big Data Analytics. The fundamental motive and goal behind developing the framework was to overcome the inefficiencies of MapReduce. Even though MapReduce was a huge success and had wide acceptance, it could not be applied to a wide range of problems. MapReduce is not efficient for multi-pass applications that require low-latency data sharing across multiple parallel operations. There are many data analytics applications which include:
MapReduce does not fit in such use cases as data has to be read from disk storage sources and then written back to the disk as distinct jobs.
Spark offers much better programming abstraction called RDD (Resilient Distributed Dataset) which can be stored in memory in between queries and can also be cached for the repetitive process. Also, RDDs are a read-only collection of partitioned objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure. Although RDDs are not a generally shared memory abstraction, they represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand. We will see the concepts of RDD in detail in the following sections and understand how RDDs are used by Spark for processing at such a fast speed.
I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.
I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!
I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.
yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.
thanks for info
Leave a Reply
Your email address will not be published. Required fields are marked *