Uploaded by Zarcklet

gupta2020

advertisement
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
A Study of Big Data Analytics using Apache Spark
with Python and Scala
2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) | 978-1-7281-7089-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICISS49785.2020.9315863
1
Yogesh Kumar Gupta 1
2
Assistant Professor, Department of Computer Science
Banasthali Vidyapith
gyogesh@banasthali.in
Abstract— Data is generated by humans every day via various
sources such as Instagram, Facebook, Twitter, Google, etc at a
rate of 2.5 quintillion bytes with high volume, high speed and
high variety. When this huge volume of data with high velocity is
handled by the traditional approaches, it becomes inefficient and
time-consuming. Apache S park technology has been used that is
an open-source in-memory clusters computing system for fast
processing. This paper introduces a brief study of Big Data
Analytics and Apache S park which consists of characteristics
(7V’s) of big data, tools and application areas for big data
analytics, as well as Apache S park Ecosystem including
components, libraries and cluster managers that is deployment
modes of Apache spark. Furthermore, this also presents a
comparative study of Python and S cala programming languages
with many parameters in the aspect of Apache S park. This
comparative study will help to identify which programming
language like Python and S cala are suitable for Apache S park
technology. As result, both Python and S cala programming
languages are suitable for Apache S park, however language
choice for programming in Apache S park is depending on the
features that best suited the needs of the project since each one
has its own advantages and disadvantages. The main purpose of
this paper is to make things easier for the programmer in the
selection of programming languages in Apache S park-based on
their project.
Keywords— Big Data, Apache S park, Cluster Computing,
Python, S cala.
I.
INT RODUCT ION
Big Data is a large set of data that can be structured, semistructured, or unstructured form, which is gathered from a
variety of data sources like Social Media, Cell Phones,
HealthCare, E-commerce, etc. John Mashe coined the term
Big Data in the 1990s, and it got trendy in the 2000s. There
are some tools and techniques of big data analytics such as
Apache Hadoop, MapReduce, Apache Spark, NoSQL,
database, and Apache Hive for data processing that is used to
manage a massive amount of big data. The main purpose of
collecting and processing huge amounts of big data helps
organizations to better understanding. Moreover, it also helps
to find the information that is most important for future
business decisions. There are three types of Big Data.
Structured Data- Those types of data is already stored in
databases in an ordered manner is called structured data.
Nowadays, at least 20% of the data are structured data, which
Surbhi Kumari 2
M .Tech(CSE) Research Scholar
Banasthali Vidyapith
surbhiroy25july@gmail.com
is generated from the sensors, weblogs, machines, and
humans, etc. Examples of structured data are DBMS, MySQL,
Spread sheet, etc.
Semi-structured Data:- Datasets can be in a structured or
unstructured format is called semi-structured data. That’s why
the developer faces difficulty categorizing it. Moreover, semistructured data can also be handled through the Hadoop
System. Examples of semi-structured data are JSON
documents, BibTex files, CSV files, XML, etc.
Unstructured Data:- Unstructured data are those data
which does not have any format and cannot be stored in a rowcolumn form. It can only be handled through Hadoop System.
At least, 80% of data are structured data in the world.
Examples of unstructured data are images, audio, video, text,
pdf, media posts, word documents, log files, E-mails data, etc.
Hadoop is one of the popular open-source scalable faulttolerant platforms for large scale distributed batch processing
by using cluster commodity servers. It was developed against
common failure issues of execution in a distributed system.
However, if compared to its performance with other
technologies is not good since data is accessed from disks for
processing. Hadoop provides fault tolerance so organizations
have not needed expensive products for processin g tasks on
large data sets. There are two key Hadoop building blocks: a
Hadoop Distributed File System that can accommodate large
datasets and a MapReduce engine that evaluates results in
batches. MapReduce is a distributed programming model for
processing massive datasets across a large cluster. It has two
functions:- Map and Reduce, which helps to utilize the
available resources for parallel processing of large data. It is
used for batch and persistent storage processing. However,
MapReduce has not been developed for real-time processing.
Apache Spark is a powerful open-source parallel
processing, flexible and user friendly platform which is very
appropriate for storing and performing big data analytics. It
can be run on vast cloud clusters and also run on a small
cluster, even run locally on student computers with smaller
datasets. Providers such as AWS and Google Cloud have
supported it. With the RDD can quickly perform processing
tasks on very large data sets as stored in memory. Apache
Spark framework consists of several dominant components
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
471
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
which include Spark core and upper-level libraries: Spark
SQL, Spark Streaming, Spark MLlib, GraphX and SparkR
which helps to perform a wide range of workloads including
batch processing, machine learning, interactive queries,
streaming processing, etc. Apache Spark system leads with
language-integrated APIs in SQL, Scala, Java, Python and R.
The major functions are accomplished on Spark Core. Existing
components are tightly integrated with the Spark Core which
provides one unified environment. It is more efficient for
especially iterative or interactive applications than all other
technologies. In this way, Apache Spark is better than other
technologies.
A. Characteristics of Big Data
wants to extract meaningful insights from data, they
need to clean up to minimize noise. Big data benefits
can only come from applications when the data is
meaningful and reliable. Therefore, data cleansing is
necessary so that inaccurate and unreliable information
can be filtered out. Example:- Data set of high veracity
will be from a medical procedure or trial.
 Validity:- The validity of the data refers to the accuracy
and correctness of data used to obtain the result in form
of information. It is very important to make decisions.
 Volatility:-The volatility of big data implies the stored
data and how long it is useful for future use. Since the
data which is valid right now might not be valid just a
few minutes or a few days later.
 Value:-The value of data is the most important element
of 7V’s in big data. It is not just the amount of data that
stores or processes by individuals. In reality, it is the
amount of precious, accurate, and trustworthy
information that needs to be stored, processed, analyzed
to find insights.
B. Challenges in Big data analytics
There are many computational methods available that work
well with small data but it does not work well for data that is
generating with high volume and velocity. Traditional tools
are not efficient to process big data and the complexity to
process big data is very high. The following challenges arise
when big data is analyzed.
1)
Fig. 1. 7Vs of Big Data
 Volume:- Big data implies enormous volumes
(TeraByte, Peta Byte) of data. Data is created by
various sources, including internet explosion, social
media, healthcare, Internet of Things, e-commerce, and
other systems, which need large storage s o the volume
of data is a big challenge.
 Velocity:- Velocity of data refers to the low latency, or
how fast the data is generated by multiple sources at
high velocities viz: social media data, healthcare data,
and retail data. Every second, 1.7 MB of data is
provided by every person during 2020.
 Variety:- Variety refers to the various types of data that
can be structured, unstructured or semi-structured,
existing different forms of data for example text data,
emails, tweets, log files, user reviews, photos, audios,
videos, and sensors data. Example: - High variety of
data sets would be the CCTV audio and video files that
are produced at different places in a city.
 Veracity:- The veracity of data refers to noise and
abnormality in big data. All data will not be 100 %
correct when dealing with high volume, velocity, and
variety of data so there can be dirty data. If anyone
2)
3)
Data Heterogeneity and Incompleteness. A major
problem of how to include all the data from various
sources to discover the pattern and trend for
researchers. It can remain difficult to analyze the
unstructured and semi-structured data. Hereby, data
must be discreetly structured before analysis. Ontology
matching is a common approach based on semantics
that detects the similarities among ontologies of
multiple sources. After data cleaning and error
correction, certain incompleteness and errors can persist
in datasets. During data processing, this incompleteness
and errors must be handled. It's a challenge to do this in
the right way.
Scalability and Storage. In data analytics, the
management and analysis of massive volumes of data is
a challenge. Storage systems are not adequately capable
of storing rapidly increasing data sets. Though, by
improving processor speed such problems can be
ameliorated. Therefore, needed to develop a processing
system that will also maintain the necessity of the
future.
Security and Privacy. A much more serious issue is
how to find meaningful information from large and
rapidly generated datasets. Researchers have many
methods and techniques to access data from any data
source to discover trends in data. They have ceased
worrying about an individual's security and privacy.
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
472
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
4)
When data is shared and agglomerated beyond dynamic
or distributed computing systems. Organizations have
been using diverse de-identification approaches to
maintain privacy and security.
Human Collaboration. Despite the enormous
advancements made in computational analysis, there
are still many patterns that humans can easily identify,
but computer algorithms have a difficult time
identifying them. The big data analysis framework must
support input from diverse human experts and the
sharing of results. These many experts can be
segregated in space and time when it is too pricey to
combine an entire team in one room. This distributed
expert input must be accepted by the data system and
their collaboration must be supported.
Other challenges: Data Replication, Data Locality, Combining
Multiple Data Sets, Data Quality, Fault-Tolerance of Big Data
Applications, Data Availability, Data Processing, and Data
Management.
C. Tools of Big Data Analytics
1) Apache Hadoop:- Apache Hadoop is one of the most
prominent big data frameworks and is written in Java.
Hadoop is originally designed to continuously gather
data from multiple sources without worrying about the
type of data and storage across a distributed
environment. Moreover, it can only perform batch
processing.
2) MapReduce:- MapReduce is a programming model
that processes and analyzes huge data sets. Google
introduced it in December of 2004. Moreover, it is used
for batch processing and persistent storage. However,
MapReduce was not built for real-time processing.
3) Apache Hive:- Apache Hive is a SQL-like query
language and established by Facebook. Hive is a data
warehousing component that performs reading, writing,
and managing large datasets in a distributed
environment.
4) Apache Pig:- Apache Pig is a high-level data flow
platform for executing MapReduce programs of
Hadoop and which was originally developed by Yahoo
in 2006. By using this, all data manipulation operations
in Hadoop can be performed.
5) Apache HBase:- Apache HBase is a distributed
column-oriented database that is run at the top of the
HDFS file system. It is nothing but a NoSQL
DataStore, and it is similar to a database management
system, but it provides quick random access to a huge
amount of structured data.
6) Apache Storm:- Apache Storm is an open-source
distributed real-time computation system. It is used
wherever to generated a lot of data streaming. Twitter
uses it for real-time data analysis.
7) Apache Cassandra:- Apache Cassandra is a free open
source NoSQL database and which was created by
Facebook. It is more popular and very robust to handle
huge amounts of data.
8)
Apache Spark:- Apache Spark is one of the most
prominent and highly valued big data frameworks. It
was developed by people from the University of
California and written in Scala. The performance of
Apache Spark is fast because it has in-memory
processing. It does real-time data processing as well as
batch processing with a huge amount of data and
requires a lot of memory, but it can deal with standard
speed and amount of disk.
D. Application areas of Big Data Analytics
 Healthcare:- Big Data analytics is used with the aid of a
patient's medical chronicle data to determine how likely
they are to have health problems. Furthermore, Big data
analytics are used in healthcare to minimize costs,
vaticinate epidemics, and prevent preventable diseases.
The Electronic Health Record is one of the most popular
applications of Big Data in the healthcare industry.
 Banking:- Banks use big data analytics to identify
fraudulent activities from the transaction. Due to the
analytics system stops fraud before it occurs and the
bank improves profitability.
 Media and Entertainment:- Entertainment and media
industries are using big data analytics to understand what
content, products, and services people want.
 Telecom:- The most relevant contributor to big data is
telecoms. They improve the services and routes of traffic
more efficiently. Furthermore, the analytics system is
used to recognize records of call details, fraudulent
behavior, and which also helps to take action
immediately.
 Government:- The Indian government used big data
analytics which helps law enforcement and to get an
estimation of trade in the country. Due to big data
analytics, governmental procedures allow competencies
in terms of expenditure, productiveness, and innovation.
 Education:- Nowadays the education department is
being observed big data analytics gradually. As a result
of big data-powered technologies have been improved
learning tools. Besides, it's used to enhance and develop
existing courses according to trade requirements.
 Retail:- Retail uses Big Data Analytics to optimize its
business, including e-commerce and in-stores. For
example, Amazon, Flipkart, Walmart, etc.
E. Overview of Apache Spark Technology
Apache Spark is an open-source distributed, in-memory
cluster computing framework designed to provide faster and
easy-to-use analytics than Hadoop MapReduce. In 2009,
AMPLab of UC Berkeley designed Apache Spark and first
released it as open-source in March 2010 and donated to the
Apache Software Foundation in June 2013. This open -source
framework protrudes for its ability to process large volumes of
data. Spark is 100 times faster than MapReduce of Hadoop
since there is no time consumed in transferring and processing
data in or out of the disk because all of these processes are
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
473
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
done in memory. It supports stream processing, known as realtime processing which includes continuous input and output of
data and is suitable for trivial operations and massive data
processing on large clusters. Many organizations such as
Healthcare, Bank, Telecom, E-Commerce, etc; and all of the
major gigantic technology companies such as Apple,
Facebook, IBM, and Microsoft are used Apache Spark to
improve their business insights. These companies collect data
in terabytes from various sources and process it, to enhance
consumer's services.
Apache Spark Ecosystem is having various components,
including Spark Core and upper-level libraries such as Spark
SQL, Spark Streaming, Spark Mllib, GraphX, and SparkR
which are built atop of Spark Core. Cluster Managers viz;
Standalone, Hadoop YARN, Mesos and Kubernetes are
operated by Spark Core.
batch and streaming processing of data in the
application.
d) Spark MLlib:- It is a package of Apache Spark which
accommodates multiple types of machine learning
algorithms (classification, regression, and clustering) on
top of spark. It performs data processing with large
datasets to recognize the patterns and make decisions.
Machine learning algorithms run with many iterations
for the desired objective in an adaptable manner.
e) GraphX:- Apache Spark leads with a module to allows
Graph distributed computing in Graph data structures.
A graph data structure is having a network of
organizations like non-manual social networks. GraphX
is also called Pregel that revealed by Google agent in
2010.
f) SparkR:- It is a module os Apache Spark that produces
incompetent forefront. SparkR is
a cluster
computational platform that allows the processing of
structured data and ML tasks. Although, R
programming language was not invented to manages
large datasets that cannot suitable to a single machine.
The cluster manager manages a cluster of computers that
consists of CPU, memory, storage, ports, and other resources
available on a cluster of nodes. Spark supports cluster
managers, including Standalone, Yarn, Mesos, and Kubernetes
which provides a script that can be used to deploy a Spark
application. Apache Spark can be operated through many
cluster managers. Currently, there are available some modes
for the deployment of Spark.
1.
Fig. 2. Apache Spark Ecosystem
a) Spark Core:- Spark Core is the main component that
available in the Apache Spark tool, all processes of
Apache Spark are handled by Spark Core. Apache
Spark provides some libraries such as Spark SQL,
Spark Streaming, Spark MLlib, and Graphx are built on
the top of Spark Core. It has RDD's, which stands for
resilient distributed datasets, which helps to execute
Spark's libraries in a distributed environment.
b) Spark SQL:- Spark SQL is a
data processing
framework in Apache Spark that is built on top of Spark
Core. Both structured and semi-structured data can be
accessed via the Spark SQL. Spark SQL can read the
data from different formats such as text, CSV, JSON,
Avro, etc. It can create powerful, interactive, and
analytical applications through both streaming and
historical data.
c) Spark Streaming:- The Spark Streaming component is
built on top of Spark Core moreover used to process
real-time data and live data. It also allows us to perform
2.
3.
Standalone:- The term standalone is meant by it does
not need an external scheduler. Spark standalone
cluster manager provides everything to start Apache
Spark. It can run quickly with few dependencies or
environmental considerations . Standalone is a cluster
management technique that is responsible for
managing the hardware and memory that runs on a
node. Spark applications are corroborated through it.
Furthermore, it manages the Spark components and
provides limited functionality. Several applications
use Standalone, such as Microsoft Word, Autodesk
3D Max, Adobe Photoshop, and Google Chrome.
This cluster manager can be run on Linux, Mac, or
Windows.
Hadoop YARN:- Even YARN (another resource
negotiator) is also a generic open-source cluster
manager that enables Spark application to share
cluster resources with Hadoop MapReduce
applications. It is associated with a component called
Job-tracker that provide features such as cluster
managing, job
scheduling, and monitoring
capabilities. YARN supports both client and cluster
mode deployment of a Spark application. It can run
on Linux or Windows.
Apache Mesos:- Apache Mesos is also an opensource cluster manager that is exquisitely scalable to
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
474
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
4.
thousands of nodes. It is a master slave-based system
and has fault-tolerant. For a cluster of machines, it
can be known as an operating system kernel. It pools
computing resources together on a cluster of
machines and allows those resources to be spread
through different applications. Mesos is developed to
support a diversity of distributed computing
applications that can be share both static and dynamic
cluster resources. Some organizations such as
Twitter, Xogito, Media Crossing, etc are used
Apache Mesos and can be run it on Linux or Mac
operating systems.
Kubernetes:- Spark runs on clusters that are
organized throKubernetes. Due to the open-source
container management platform, it has been ingested
to spark. It comes up with Google in 2014.
Kubernetes brings with its advantages as feasibility
and stability. So can run full spark cluster on it.
Kubernetes is a portable and cost-effective platform
that comes with self-healing abilities. It is developed
for managing the complex distributed system without
invalidating of containers empowers .
model of big data performs some
operation
like calculating
average speed rate, code
necessity, etc, with the spark.
T his application was conducted
by data processing techniques.
T hus, applied it to the healthcare
system.
[4]
[5]
Ajaysinh,
R. P., &
Somani,
H. (2016)
Salwan,
P., &
Maan, V.
K. (2020)
Apache Spark,
and Machine
Learning
Algorithms
Apache Spark
II. LIT ERAT URE REVIEW
Table 1. Literature Survey
Sr.
No.
[1]
[2]
[3]
Author
Name
Omar, H.
K., &
Jumaa, A.
K(2019)
Van-Dai
T a et al.
(2016)
Keerti,
Singh, K.,
&
Dhawan,
S. (2016)
Algorithms/
Techniques
Apache Spark
tool,
Mllib
library with
Python and
Scala
Apache Spark
tool, Steaming
API, Machine
Learning and
data mining
techniques
Hadoop
MapReduce,
Apache Spark
O bservation
T he authors presented a
comparison between Java And
Scala for evaluating time
performance in Apache Spark
Mllib .also explain tools, APIs,
programming language, and
Spark machine learning libraries
in Apache Spark. Furthermore
discovered the advantages of
data loading and accessing from
stored sources like Hadoop,
HDFS, Cassandra, HBase, etc.
T he authors concluded that the
performance of Scala is much
better than Java performance.
In this paper, creat ed a general
architecture using the Spark
streaming method that can
implement in the healthcare
system in big data analytics.
Also, explain how can be
enhanced efficiency through
machine learning and data
mining techniques.
Researchers are focusing on the
big data application model that
can be used in the real-time
system, social network area, and
in the healthcare system. T his
paper gave an introduction to
MapReduce, Hadoop, and Spark.
Also, Spark is compared with
MapReduce. T he three-layered
[6]
[7]
Bhattacha
rya A. &
Bhatnagar
, S. (2016)
Salloum,
S.,
Dautov,
R., Chen,
X., Peng,
P. X., &
Huang, J.
Z. (2016)
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
Hadoop
Map Reduce
And
Spark
tools
Apache Spark
T he authors have implemented
the healthcare model using
different analysis and prediction
techniques
with
machine
learning algorithms for better
predictions.
T he work in this paper is
focusing on the e-governance
system that is built using an
apache spark for analyzing
government -collected
data.
authors gave a brief explanation
of the architecture of apachespark, including core layer,
ecosystem layer, resource
management, and methods that
are used in a spark government
department is generated data
with high volume so cannot be
managed by the traditional
database management system,
thus built a system with more
efficiency using big data
analytics
techniques.
.furthermore resolved main
issues of traditional database
management systems like speed,
mixed typed datasets, accuracy,
etc.
T he authors explained the
concept of big data, and Apache
Spark, firstly introduces big data
and a very important part of big
data is V’s. Moreover, big data
analytics, security issues of big
data analytics, and a variety of
tools that are available in the
market like Hadoop MapReduce,
Apache
Spark
are also
explained.
Furthermore,
presented a comparison between
Hadoop’s MapReduce and
Apache Spark on some features
such as memory, competitive
product.
In this paper, the author is
focusing
on
the
basic
components and unique features
of apache spark to big data
analytics. With the help of
apache-spark, some Ml pipelines
API and distinct utilities are
produced for designing and
implementing. T he authors
illustrated how to increase the
popularity of apache spark
technology to the research field
in big data analytics.
475
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Hongyong
Yu &
Deshuai
Wang
(2012)
Hussain et
al. (2016)
Shaikh,
E.,
Mohiuddi
n, I.,
Alufaisan,
Y., &
Nahvi, I.
(2019)
M.U.
Bokhari et
al.
MapReduce,
Apache Hive
Apache Spark
and Hadoop
MapReduce
Apache Spark
Apache Spark,
HDFS
and
Machine
Learning
U. R. Pol
(2016)
Apache Spark,
Hadoop
MapReduce
Sunil
Kumar&
Maninder
Singh
(2019]
Apache
Hadoop
Aziz et al.
(2018)
Apache Spark,
MapReduce
T he authors have explained how
to improve the privacy and
security of data in big data
analytics. T hey discussed both
MapReduce and Apache Hive
frameworks with usages, ease of
use, capability, and processing.
T he authors focused on a
learning analytics model that can
predict future and trends from
educational
organizations.
Furthermore, explored usages,
methodologies
(Hadoop
MapReduce) of big data and also
recognized some issues such as
data
privacy,
capacity,
processing, and analyzing of
data. T he authors concluded that
the prediction model helps
education
authorities
for
learning activities and patterns
that are in the trends.
T he authors discussed how to
perform in-memory computing
processes in apache spark and
spark compared with other tools
for fast computing. Furthermore,
explained batch processing and
stream
processing
with
capabilities. And also discussed
the
multithreading
and
concurrency capabilities of
Apache Spark.
.
T he authors implemented a
three-layered
model
of
architecture for storage and data
analysis. T here are used HDFS
for data storage, and machine
learning algorithms for data
analysis.
T he author gave a brief
explanation of big data and its
analytics using Apache Spark.
furthermore, explain how apache
spark overcomes Hadoop which
is a good framework for data
processing and also open-source
distributed computing for
reliability and scalability in big
data analytics
T hey have discussed the impact
of big data on the healthcare
system and how to manage
different tools that are available
in the Hadoop ecosystem.
Moreover, they have also
explored
the
conceptual
architecture of big data analytics
for healthcare systems.
T he authors described how to
process real-time data using
Apache Spark and Hadoop tools
in big data analytics. And also
compared Apache Spark and
Hadoop for fast computing
[15]
[16]
[1723]
[24]
Amol
Bansod
(2015)
Apache Spark,
Hadoop
MapReduce
Shoro, A.
G., &
Soomro,
T . R.
(2015)
Apache Spark
And T witter
Stream API
Gupta,
Y., et. al.
Apache Pig
Shrutika
Dhoka &
R.A.
Kudale
(2016)
Apache Spark
T he author described the
advantage of the Apache Spark
framework for data processing in
HDFS and also compared it with
Hadoop MapReduce, and with
other
data
processing
frameworks
In this paper, the author
discussed the concept of big
data, characteristics of big data
(V’s), and big data analytics
tools that are Hive, Pig, Zebra,
HBase, Chu Kwa, Storm, and
Spark. Moreover, given some
reasons for apache spark
technology when should use it or
not. furthermore, they performed
data processing with T witter
data by Apache Spark and
T witter stream API.
Authors Analyzed various
datasets such as stock exchange
Data, Crime rates of India,
Population of India, and
Healthcare data using Apache
Pig. Also elaborated various
tools and techniques used to
analyses the massive volume of
data i.e. big data in the Hadoop
distributed file system of the
cluster of commodity hardware.
T he authors also describe
various
image
processing
techniques.
T he authors have developed a
conceptual architecture on the
Apache Spark platform to
overcome the problems that get
when processing big data in the
healthcare system.
In this paper, Big data analytics and Apache Spark is
explained in various aspects. Authors focused on
characteristics (7V’s) of big data, tools and application areas
for big data analytics, as well as Apache Spark Ecosystem
including components, libraries and cluster managers that is
deployment modes of Apache spark. Ultimately, they also
present a comparative study of Python and Scala programming
languages with various parameters in the aspect of Apache
Spark.
III. RESEARCH GAP
After reading all these research papers, the vast amount of
Big Data can be processed using Python and Scala
programming languages over Apache Spark. Also, the present
comparisons between both programming languages for fast
data processing in Apache Spark defines which programming
language is best suited for Apache Spark that can give a better
result in Big Data Analytics.
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
476
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
IV. COMPARAT IVE ST UDY OF PYT HON AND SCALA IN SPARK
Pe rformance
Slower
Faster
The learning curve of Scala is a bit tough as compared to
Python due to simple syntax. However, Scala has enigmatic
syntax and a lot of operations that are combined into a single
statement.
The availability of libraries is very rich in Python in
contrast to Scala. However, Scala libraries are not more
compatible with a data scientist. So python is the preferable
language for this aspect. In Python, the testing process and its
methodologies are very complex due to a dynamically typed
language. However, testing of scala is less complex. Python
holds less verbosity because of dynamically typed language
Scala is a statically typed language that can identify
compilation time error so it is a better alternative for largescale projects.
Scala supports features of multithreading so it can be
handle parallelism and concurrency while python does not
support multithreading. The Python community keeps
organizing conferences, reunites, and work on code to develop
the language. Python has much larger community support in
comparison to Scala.
02
Le arning
Curve
Easy
T ough
V. DISCUSSION
03
Machine
Le arning
Librarie s
Rich library
Less library
Platform
Interpreter
Complier
Visualiz ation
Librarie s
Rich library
Less library
Type Safe ty
T he dynamic type
of language.
T he static type of
language.
Te sting
Very complex
Less complex
Simplicity
Easy to learn
because
writing
code is simple.
Scala may be
difficult to learn
than Python.
Ease of use
Less Verbose
High Verbose
Concurre ncy
No support
Highly supported
IDE
Pycharm, Jupyter
Eclipse, Intellij
Spark She ll
>Pyspark
>Scala
Support
Much
larger
community
support.
Less community
support
Python is an object-oriented, high-level functional and
interpreted based programming language that runs directly on
the machine. Pyspark is a Python API for the Apache Spark
that works with RDDs due to multiple operations in python. It
is a very powerful and preferred programming language
because of its high availability of libraries.
Scala is an object-oriented and functional programming
language that runs on JVM (Java virtual machine) and also
helps the developers to be more programmers to be more
inventive. It is a great programming language that helps to
write valid code without error and also helps to develop Big
Data application.
Table 2. Comparison between Python and Scala
Sr. No
01
04
Parame te rs
05
06
07
08
09
10
11
12
13
Python
Scala
The performance is a very important factor, when Python
and Scala are used with Spark, the performance of Scala is 10
times faster rather than Python performance. Python is a
dynamically typed language, so the speed is reduced. During
runtime, Scala uses Java Virtual Machine and it is a statically
typed language consequently, this originates speed. The
compiled language is faster than the interpreted language.
The discussion is related to programming languages that
are appropriate for the big data field. A lot of programming
language is used to solve Big Data problems , simultaneously
for big data professionals choosing a language is the most
important part. The decision of programming language must
be suitable, thus here can perform analysis and manipulation
to Big Data Problems so that they can achieve the desired
output. By Omar, H. K., & Jumaa, A. K. (2019) [1], it was
found that the performance of the Scala programming
language is better than Java performance in Spark MLlib.
The authors have presented the comparative study of
Python and Scala programming languages with parameters for
Apache Spark. The performance of Scala is faster but a little
difficult to learn although Python is slower while it is very
simple to use. Scala does not provide enough big data
analytics tools and libraries such as Python for machine
learning and natural language processing and there are no
good visualization tools for Scala. Python support for Spark
Streaming is not advanced and mature like Scala consequently
Scala is the best option for Spark Streaming functionality.
Apache Spark framework is written in Scala, so with Scala big
data developers easily dig into the source code. Scala is more
engineering-oriented but Python is more analytical-oriented,
both languages are excellent to develop the big data
applications. To exploit the full potential of Spark, Scala will
be more helpful.
After exploring, the authors concluded that in Big Data
Analytics, both Python and Scala programming languages are
apt for Apache Spark technology. However, language choice
for programming in Apache Spark depends on the features that
suit the needs of the project and can also effectively solve the
problem as each language has its own benefits and drawbacks
If the programmer works on smaller projects with less
experience then python is a good choice. if the programmer
has large-scale projects that need many tools, techniques, and
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
477
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
multiprocessing, then the Scala programming language is the
best alternative. The main objective of this paper, it will make
things easier for the programmer to a selection of
programming languages in Apache Spark according to their
project and achieved its desired objectives.
VI. CONCLUSION
This paper has explained big data analytics and Apache
Spark from various aspects. Authors focused on characteristics
(7V’s) of big data, tools and application areas for big data
analytics, as well as Apache Spark Ecosystem including
components, libraries and cluster managers that is deployment
modes of Apache spark. Ultimately, they also present a
comparative study of Python and Scala programming
languages with various parameters in the aspect of Apache
Spark. It consists of a table for comparison between Python
and Scala programming languages with various parameters.
This comparative study has been concluded that both Python
and Scala programming languages are suitable for Apache
Spark technology. However, language choice is depending on
the features that best suit the needs of the project as each
language has its own advantages and disadvantages. The main
purpose of this paper is to make things easier for the
programmer in the selection of programming languages in
Apache Spark-based on their project.
REFERENCES
[1]
Omar, H. K., & Jumaa, A. K. (2019). Big Data Analysis Using
Apache Spark MLlib and Hadoop HDFS with Scala and Java.
Kurdistan Journal of Applied Research, 4(1), 7–14
https://doi.org/10.24017/science.2019.1.2
[2] Van-Dai
T a,
Chuan-Ming
Liu,
Goodwill
Wandile
Nkabinde(2016).Big Data Stream Computing in Healthcare RealT ime
Analytics.
IEEE,
pp.
37-42,
DOI:10.1109/ICCCBDA.2016.7529531
[3] Keerti, Singh, K., & Dhawan, S. (2016). Future of Big Data
Application & Apache Spark vs. Map Reduce. 1(6), 148 –151.
[4] Ajaysinh, R. P., & Somani, H. (2016). A Survey on Machine
learning assisted Big Data Analysis for Health Care Domain. 4(4),
550–554.
[5] Salwan, P., & Maan, V. K. (2020). Integrating E-Governance with
Big Data Analytics using Apache Spark. International Journal of
Recent T echnology and Engineering, 8(6), 1609 –1615.
https://doi.org/10.35940/ijrte.f7820.038620
[6] Bhattacharya, A., & Bhatnagar, S. (2016). Big Data and Apache
Spark: A Review. International Journal of Engineering Research &
Science (IJOER) ISSN, 2(5), 206–210. https://ijoer.com/Paper-May2016/IJOER-MAR-2016-9.pdf
[7] Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z.
(2016). Big data analytics on Apache Spark. International Journal of
Data
Science
and
Analytics,
1(3–4),
145–164.
https://doi.org/10.1007/s41060-016-0027-9
[8] Hongyong Yu, Deshuai Wang(2012). Research and
Implementation of Massive Health Care Data Management and
Analysis Based on Hadoop. IEEE, pp. 514 -517, DOI:
10.1109/ICCIS.2012.225
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave,
A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi,
A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A
unified engine for big data processing. Communications of the
ACM, 59(11), 56–65. https://doi.org/10.1145/2934664
[10] Shaikh, E., Mohiuddin, I., Alufaisan, Y., & Nahvi, I. (2019). Apache
Spark: A Big Data Processing Engine. 2019 2nd IEEE Middle East
and North Africa Communications Conference, MENACOMM
2019. https://doi.org/10.1109/MENACOMM46666.2019.8988541
[11] M. U. Bokhari, M. Zeyauddin, and M. A. Siddiqui(2016).An
effective model for big data analytics. 3rd International Conference
on Computing for Sustainable Global Development, pp. 3980-3982,
2016.
[12] Pol, U. R. (2016). Big Data Analysis : Comparison of Hadoop
MapReduce and Apache Spark. International Journal of Engineering
Science
and
Computing,
6(6),
6389–6391.
https://doi.org/10.4010/2016.1535
[13] Kumar, S., & Singh, M. (2018). Big data analytics for the healthcare
industry: impact, applications, and tools. Big Data Mining and
Analytics,
2(1),
48–57.
https://doi.org/10.26599/bdma.2018.9020031
[14] Aziz, K., Zaidouni, D., & Bellafkih, M. (2018). Real-time data
analysis using Spark and Hadoop. 2018 4th International
Conference on Optimization and Applications (ICOA).
DOI:10.1109/icoa.2018.8370593
[15] Amol Bansod. (2015). Efficient Big Data Analysis with Apache
Spark in HDFS. International Journal of Engineering and Advanced
T echnology, 6, 2249–8958
[16] Shoro, A. G., & Soomro, T . R. (2015). Big data analysis: Apache
spark perspective. Global Journal of Computer Science and
T echnology.
[17] Gupta, Y. K. & Sharma, S. (2019). Impact of Big Data to Analyze
Stock Exchange Data Using Apache PIG. International Journal of
Innovative Technology and Exploring Engineering. ISSN: 22783075, 8(7), Pp. 1428-1433.
[18] Gupta, Y. K. & Sharma, S. (2019). Empirical Aspect to Analyze
Stock Exchange Banking Data Using Apache PIG in HDFS
Environment. Proceedings of the Third International Conference on
Intelligent Computing and Control Systems (ICICCS 2019).
[19] Gupta, Y. K. & Gunjan B. (2019). Analysis of Crime Rates of
Different States in India Using Apache Pig in HDFS Environment.
Recent Patents on Engineering. Print ISSN: 1872-2121, Online
ISSN:
2212-4047,
13:1.
https://doi.org/10.2174/1872212113666190227162314.
site:http://www.eurekaselect .com/node/170260/article.
[20] Gupta, Y. K.* & Choudhary, S. (2020). A Study of Big Data
Analytics with T wo Fatal Diseases Using Apache Spark
Framework. International Journal of Advanced Science and
Technology (IJAST), Vol. 29, No. 5, pp. 2840 - 2851.
[21] Gupta, Y. K.*, Kamboj, S. & Kumar, A. (2020). Proportional
Exploration of Stock Exchange Data Corresponding to Various
Sectors Using Apache Pig. International Journal of Advanced
Science and Technology (IJAST), Vol. 29, No. 5, pp. 2858 - 2867.
[22] Gupta, Y. K.* & Mittal, T. (2020). Empirical Aspects to Analyze
Population of India using Apache Pig in Evolutionary of Big Data
Environment, International Journal of Scientific & Technology
Research (IJSTR). ISSN 2277-8616, 9(1), Pp. 238-242.
[23] Gupta, Y. K. & Jha, C. K.(2016). A Review on the Study of Big
Data with Comparison of Various Storage and Computing T ools and
their Relative Capabilities. International Journal of Invocation in
engineering & technology (IJIET). ISSN: 2319-1058, 7(1), Pp. 470477.
[24] hoka, Shrutika; A. Kudale, R. (2016). Use of Big Data in Healthcare
with Spark. Proceedings - International Symposium on Parallel
Architectures, Algorithms and Programming, PAAP, 2016 Janua(11), 172–176. https://doi.org/10.1109/PAAP.2015.4
978-1-7281-7089-3/20/$31.00 ©2020 IEEE
478
Authorized licensed use limited to: Carleton University. Downloaded on June 02,2021 at 03:18:18 UTC from IEEE Xplore. Restrictions apply.
Download