Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 11 | Nov 2023

p-ISSN: 2395-0072

www.irjet.net

Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark 1Sukhpreet Singh, 2Jaswinder Singh, 3Sukhpreet Singh 1Assistant Professor ,Faculty of Computing, Guru Kashi University, Punjab ,India

2Assistant Professor ,Faculty of Computing, Guru Kashi University, Punjab ,India 3Assistant Professor ,Faculty of Computing, Guru Kashi University, Punjab ,India ---------------------------------------------------------------------***--------------------------------------------------------------------better suited for particular big data analytics use cases. Abstract: Big Data Analytics has become essential for

Therefore, a comparative study of these two tools is essential to identify their strengths and weaknesses and to determine which tool is better suited for a particular big data analytics task. This paper aims to provide a comparative study of Apache Hadoop and Apache Spark for big data analytics. The paper will first provide an overview of big data analytics and its importance. Then, it will introduce Apache Hadoop and Apache Spark and compare their architecture, processing speed, scalability, and ease of use. The paper will also discuss the strengths and weaknesses of each tool and provide a use case analysis to determine which tool is better suited for different big data analytics tasks [3, 18].

organizations to extract valuable insights and make informed decisions. Apache Hadoop and Apache Spark are two prominent frameworks widely used for Big Data processing and analytics. This research paper aims to provide a comparative evaluation of Apache Hadoop and Apache Spark in terms of their capabilities, performance, scalability, and ease of use. By analyzing various aspects such as data processing models, fault tolerance, programming paradigms, and ecosystem tools, we aim to identify the strengths and weaknesses of each framework. The findings from this study will help organizations in selecting the appropriate framework for their specific Big Data analytics requirements.

2. Literature review

Keywords: Big Data Analytics, Apache Hadoop, Apache Spark, Comparative Evaluation, Data Processing, Performance, Scalability, Fault Tolerance, Ecosystem Tools.

This is a critical section of any research paper that provides an in-depth analysis of the existing literature and research studies related to the topic. In this paper, the literature review section will discuss the various studies conducted on Apache Hadoop and Apache Spark Y for big data analytics. The section will aim to provide a comprehensive understanding of the two technologies and the comparative analysis conducted by previous researchers. Apache Hadoop and Apache Spark Y are two of the most widely used big data analytics tools. Apache Hadoop is a popular open-source framework used for distributed storage and processing of large data sets across clusters of computers. Hadoop has been widely used in various industries, including healthcare, finance, and retail, for storing and analyzing large volumes of data. On the other hand, Apache Spark Y is a fast and efficient open-source big data processing engine that is known for its in-memory data processing capabilities. Spark Y has become increasingly popular in recent years due to its ability to process data faster than Hadoop.

1. Introduction The rapid growth of the internet and the advancements in the fields of technology has led to a massive explosion of data, commonly known as big data. With the increasing volume, variety, and velocity of data, traditional data processing techniques have become inefficient and insufficient to handle the processing and analysis of big data. Big data analytics has emerged as a promising solution to address the challenges posed by big data. Big data analytics involves the application of various techniques and tools to extract meaningful insights from large and complex datasets. Two of the most widely used big data analytics tools are Apache Hadoop and Apache Spark. Both tools are opensource and designed to handle large-scale data processing tasks. Apache Hadoop is a distributed storage and processing framework that uses the Hadoop Distributed File System (HDFS) and MapReduce programming model to store and process large datasets. On the other hand, Apache Spark is a distributed computing framework that offers in-memory processing and uses Resilient Distributed Datasets (RDDs) to process data [1, 2].

Several studies have been conducted on the comparative analysis of Apache Hadoop and Apache Spark Y for big data analytics [4]. The researchers compared the performance of Hadoop and Spark Y in processing big data. The study found that Spark Y performed significantly better than Hadoop in terms of data processing speed and efficiency [5]. The researchers compared the two tools' performance in processing large-scale machine learning algorithms. The study found that Spark Y outperformed Hadoop in terms of speed and accuracy. However, some studies have also

Despite the similarities, there are significant differences between Apache Hadoop and Apache Spark in terms of their architecture, processing speed, scalability, and ease of use. These differences have led to debates about which tool is

© 2023, IRJET

|

Impact Factor value: 8.226

|

ISO 9001:2008 Certified Journal

|

Page 33


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.