avatarAmar Balu

Summary

The provided web content outlines the architecture and advantages of Apache Spark as a fast and unified analytics engine for large-scale data processing, contrasting it with traditional MapReduce and HDFS.

Abstract

Apache Spark is introduced as an open-source, unified analytics engine designed for rapid processing of large-scale data. It is highlighted for its speed, attributed to in-memory computation and a Directed Acyclic Graph (DAG) scheduler, which significantly outperforms the disk-based MapReduce processing model. Spark supports multiple programming languages and libraries, and it can run on various cluster managers, accessing data from diverse storage systems. The text emphasizes Spark's real-time data processing capabilities and its suitability for iterative algorithms, which are bottlenecks in MapReduce due to disk I/O overhead. Use cases for Spark include e-commerce data analysis, banking fraud detection, real-time analytics for ride-sharing services, and personalized advertising. The article also clarifies the distinction between Spark as a data processing layer and HDFS as a data storage layer, noting that Spark can utilize HDFS for storage. The comparison between Spark's in-memory processing and MapReduce's disk-based processing details the time delays associated with disk storage, such as seek time, rotational delay, and transfer time, advocating for the efficiency of in-memory computing for immediate results in interactive and iterative computations.

Opinions

  • The author suggests that Spark's in-memory processing is superior to MapReduce's disk-based processing, particularly for applications requiring speed and real-time analytics.
  • Spark is portrayed as a versatile tool that enhances the efficiency of data processing across various industries and applications.
  • The article implies a preference for Spark over traditional Hadoop components for modern data processing needs.
  • The author posits that Spark's ability to handle large volumes of data at high speeds is a significant advantage, making it ideal for tasks like fraud detection and real-time pricing.
  • There is an emphasis on the importance of Spark's support for multiple languages and libraries, which contributes to its flexibility and ease of use for developers.
  • The text encourages readers to support small publishers and enhance their growth, indicating a value placed on community and knowledge sharing within the tech industry.

Introduction to Spark Architecture

It is an Open Source, Unified Analytics Engine for Large Scale Data Processing.

img src: https://www.databricks.com/glossary/what-is-spark-streaming

Why Spark?

Although we have MapReduce for Handling large Volume of Data. The process involves in Reading and Writing the Data is Little time Consuming.

But Spark helps us to handle data in Lightening Speed and also it can be used for Real Time Data Processing.

The Speed of Spark is due to the In-Memory Computation method and also it uses DAG Scheduler. (Directed Acyclic Graph).

Features of Spark

  • It supports various Languages like Java, Sql, Python.
  • It supports various Libraries like Spark SQL, Spark Streaming, MLlib, GraphX.
  • Spark can use its standalone cluster Manager to run on Apache Mesos, Hadoop Yarn or Kubernetes.
  • It can Access data stored in HDFS, S3 and any other Local Storage
  • It uses Lazy Evaluation Process as it is the reason for speed of spark.

Difference Between Spark and HDFS?

HDFS -> It acts as a Data Storage Layer. Hadoop is a combination of HDFS and MapReduce Spark -> It acts as a Data Processing Layer. Spark can use HDFS as a data Storage Layer.

UseCases

Following are the places, where spark is largely used

Ecommerce -> for Data Analysing their sales and inventory effectively. Banking -> to Analyse data for Fraud Detection Ola & Uber -> to analyse real time data for calculating price and availability of drivers Personalised Ads -> Based on user’s browser history

Spark vs MapReduce

There are two types of Storage system available generally. They are

  1. In-Memory
  2. Disk

MapReduce follows Disk Based Processing System in which the data are stored in the Hard Disk.

Example

If you are trying to process a data, it actually follows up 2 phases. i.e Map & Reduce. In Map, data is fetched form the disk and it is processed and transformed into partition as an intermediate Result and stored in disk. In Reduce, the intermediate Result is fetched from the disk again and it is processed as input and the final result is again stored in the disk.

So lets discuss about the different time delays which happen’s during this Entire process.

Seek Time :

The Time taken by the Read/Write head to move from Current position to New Position.

Rotational Delay :

It is the time taken by the disc to rotate so that it brings the Read/Write head point to the beginning of the data.

Transfer Time :

It it the time taken to Read/Write the data From and To the hard disk.

Access Time :

It denotes the total time Delay between the Request Submitted and the Response Received.

Access Time = Seek Time + Rotational Delay + Transfer Time

MapReduce is useful to read data from large files in small quantity as the number of head movement will be less and the time consumption will be handles effectively.

Why use In-Memory and Why not use Disk Storage?

  • If data is stored in-memory, random access is possible whereas it is time consuming in disk as it only allows sequential storage
  • Real time data processing is fast, so here in-memory is used as we need immediate results to get processed immediately.
  • Immediate results includes two category’s as Interactive Results and Iterative Results are stored in-memory and can be used for further computation

Example Scenario for Iterative Results in Mapreduce and Spark:

If we need to process the data iteratively, in MapReduce after each map and reduce process the data need to be stored in Disk and need to be fetched again which is time consuming.

Img src: https://www.macrometa.com/event-stream-processing/apache-spark-vs-hadoop

In Spark, we use in-memory storage so the process does not take any extra memory and time for storage and retrieval as it uses lighting speed and the process will be efficient.

Example Scenario for Interactive Results in Mapreduce and Spark:

In MapReduce, whenever any query need to be processed, then the data need to fetched from the disk for each time and to be processed. It is time consuming process when compared to spark.

Img Src : https://www.projectpro.io/article/working-with-spark-rdd-for-fast-data-processing/273

In Spark, the data is stored in the distributed memory. The data is loaded into the distributed memory from the disk as a HDFS read process and it happens once. So even for multiple queries the data’s are fetched from this distributed memory and can be efficient.

Img src : https://www.projectpro.io/article/working-with-spark-rdd-for-fast-data-processing/273

Types of Processing

Batch Processing

Data collected over a period of time and is stored in the form of batch also the result cannot be expected immediately. As a result this process in time consuming and does not need any high speed processing frameworks.

Real Time Processing

Instant Responses and high Speed processing. Eg: Banking for Fraudulent Detection.

I hope this article would have given you an idea about why spark and what’s the difference between spark and MapReduce.

Happy learning 😄

Kindly help us to grow by following this publication

Data Science
Spark
Data Engineering
Mapreduce
Recommended from ReadMedium