Uploaded by Kshitij Dhar

Big Data and Cloud Systems

advertisement
BIG DATA AND CLOUD SYSTEMS
Bhavika Arigala, Kshitij Dhar, Aishwary, M S Vishvesh
Department of Computer Science, Sir M Visvesvaraya Institute of Technology
Abstract- The term big data came to light under
the enormous increase of global data as a
technology that is able to store, manage,
analyze large datasets providing both
enterprises
and
science
with
deep
understanding over its clients/experiments. Big
Data includes structured as well as
unstructured data. Cloud computing is one of
the most powerful technology which performs
massive-scale and complex computing. It
eliminates the requirement to support expensive
computing hardware, dedicated space and
related software. Within the context of this
paper we present the overview , related work,
challenges of both the concepts mentioned
above.
Keywords : Hadoop, MapReduce, HDFS,
Big Data, Cloud computing.
1. INTRODUCTION
Big Data Analytics (BDA) is a field that
treats ways to analyze, systematically extract
information from, or otherwise deal with
data sets that are too large or complex to be
dealt with by traditional data-processing
application software. Data with many cases
provides greater and efficient statistical
advantage, but data sets with numerous rows
and columns can make the process
cumbersome which may lead to false
discovery rate. Current usage of the term big
data tends to refer to the use of predictive
analytics,user behaviour analytics, or certain
other advanced data analytics methods that
extract value from data, and seldom to a
particular size of data set. Big data is
associated with five keywords, also known
as the five V’s of big data- volume, variety,
velocity, value, veracity.
1. Volume includes many factors that
contribute to the increase in volume
of data such as IoT.
2. Variety concerns with the different
types of data from various sources
that big data frameworks have to deal
with.
3. Velocity concerns with the different
rates at which data streams may get in
or out of the system and provides an
abstraction layer so that big data
systems can store data independently
of the incoming or outgoing rate.
4. Value concerns the true value of data
(i.e., the potential value of the data
regarding the information they
contain). Huge amounts of data are
worthless unless they provide value.
5. Veracity refers to the dependability
of the data, addressing data
confidentiality,
integrity,
and
availability. Organizations need to
ensure that data as well as the
analyses performed on the data are
correct.
Cloud computing is another paradigm which
promises theoretically unlimited on-demand
services to its users. Clouds may be limited
to a single organization or be available to
many
organisations.
Large
clouds,
predominant today, often have functions
distributed over multiple locations from
central servers. If the connection to the user
is relatively close, it may be designated an
edge design.
A number of architectures and deployment
models exist for cloud computing, and these
architectures and models are able to be used
with other technologies and design
approaches. Owners of small to medium
sized businesses who are unable to afford
adoption of clustered NAS technology can
consider a number of cloud computing
models to meet their big data needs. Small to
medium sized business owners need to
consider the correct cloud computing in
order to remain both competitive and
profitable.
Cloud providers usually offer three different
basic services: Infrastructure as a Service
(IaaS); Platform as a Service (PaaS); and
Software as a Service (SaaS)
Since the cloud virtualizes resources in an
ondemand fashion, it is the most suitable and
compliant framework for big data
processing, which through hardware
virtualization creates a high processing
power environment for big data.
2. RELATED WORK
There are several software products and
technologies available to facilitate big data
analytics, with the help of analytical
techniques. These tools help in big data
processing and resource management. We
have described the most common ones in this
paper.
2.1 Hadoop
Hadoop is a free Java based programming
framework that manages data processing and
storage for big data applications in a
distributed computing environment. Formerly
known as Apache Hadoop, it is developed as
a part of an open source project within the
Apache Software Foundation.
Hadoop’s ability to process and store
different types of data makes it a particularly
good fit for big data environments. [1]
Hadoop systems can handle various forms of
data thereby providing more flexibility than
those than relational databases and
warehouses provide. Hadoop Cluster uses a
Master/Slave structure. Using Hadoop, large
amount of datasets can be processed over
cluster of servers and applications can be run
on systems with thousands of nodes involving
thousands of terabytes of information as
shown in fig 1. [2]
H
Fig 1. Hadoop Framework Structure
Distributed file system in Hadoop helps in
rapid data transfer rates and allows the
system to continue its normal operations even
in the case of any node failures, This
approach lowers the risk of an entire system
failure, enabling a computing solution that is
scalable, cost effective, flexible and fault
tolerant.
Hadoop framework is used by several
company giants such as IBM, Google, Yahoo,
Amazon, etc. to support their applications
involving huge amounts of data. Some
popular institutions manage their analytical
data in the enterprise data based warehouse
(EDW) by its own while others use a
different platform which helps relieve some
of the burden on the server, resulting from
managing your data on the EDW. [3] Servers
can be added or removed from the cluster
dynamically without causing any interruption
to the operations. Hadoop has two main
subprojects - Map Reduce and Hadoop
Distributed File System (HDFS).
2.2 Map Re duce
Map Reduce is a programming model
suitable for processing of huge data. Hadoop
is capable of running MapReduce programs
written in various languages: Java, Ruby,
Python, and C++. MapReduce programs are
parallel in nature, thus are very useful for
performing large-scale data analysis using
multiple machines in the cluster. [4]
In simple terms, it is a programming model
and an associated implementation for
processing and generating large data sets
with a parallel, distributed algorithm on a
cluster. A MapReduce program is composed
of a map procedure (or method), which
performs filtering and sorting (such as sorting
students by first name into queues, one queue
for each name), and a reduce method, which
performs a summary operation (such as
counting the number of students in each
queue, yielding name frequencies). [5] This
framework is reliable and fault tolerant. It is
a processing technique built on the divide and
conquer algorithm.
individual map jobs in parallel. The result of
the maps sorted by the framework are then
sent as an input to reduce tasks.
Consider we have the following input to our
Map Reduce Program :
Initially, the application is divided into
individual chunks which are processed by
Welcome to Hadoop Class | Hadoop is good | Hadoop is bad
Fig 2. Map Reduce Architecture using an example
The mapping step takes a set of data in order
to convert it into another set of data by
breaking the individual elements into
key/value pairs called tuples. The second
step of reducing takes the output derived
from the mapping process and combines the
data tuples into a smaller set of tuples. [5]
Scheduling, Monitoring and Re-executing
failed tasks are taken care of by the
framework. [6]
HDFS Master (NameNode) - Namenode
regulates file access to the clients. It
maintains and manages the slave nodes and
assign tasks to them.
Namenode executes file system namespace
operations like opening, closing, and
renaming files and directories.
2.3 Hadoop Distribute d File Syste m (HDFS)
The Hadoop Distributed File System (HDFS)
is the primary data storage system used by
Hadoop applications. It employs a
NameNode and DataNode architecture to
implement a distributed file system that
provides high-performance access to data
across highly scalable Hadoop clusters. [7] It
spans all the nodes in a Hadoop cluster for
data storage.
HDFS Slave (DataNode) - There are n
number of slaves (where n can be upto 1000)
or data nodes in Hadoop Distributed File
System which manage storage of data. These
slave nodes are the actual worker nodes
which do the tasks and serve read and write
requests from the file system’s clients. They
also perform block creation, deletion, and
replication upon instruction from the
NameNode. [8]
Fig 3. HDFS Architecture
3. ADVANTAGES
Ability to make faster, more informed
business decisions, backed up by facts.
Deeper
understanding
of
customer
requirements which, in turn, builds better
business relationships. Increased awareness
of risk, enabling the implementation of
preventative measures. Here are few more
benefits: [9]
● Knowing errors instantly within the
organisation.
● Implementing new strategies.
● To improve service dramatically.
● Fraud can be detected the moment it
happens.
● Cost savings.
● Better sales insights.
● Keep up the customer trends.
● Using big data increases your
efficiency.
● Using big data improves your pricing.
● You can compete with big businesses.
● Allows you to focus on local
preferences.
● Using big data helps you increase
sales and loyalty.
4. RISKS
4.1 Legal compliance
Where the data contains personal data, a
business needs to comply with data
protection laws and the upcoming
requirements of the General Data Protection
Regulation. Personal data is any data which
can identify an individual (e.g. name,
location data, IP address). A failure to
comply with data protection legislation could
result in a serious data breach which can
attract large fines and lead to reputational
damage. [10]
Regulatory burden and risks of breaching
data protection laws can be mitigated by
considering whether all of the data is
necessary, whether a time limit for retention
could be imposed or whether the personal
data can be anonymised. Where personal
data is used, business need appropriate
practices and policies in place.
4.2 Accuracy
Big Data Analytics is predictive in nature
and sometimes means that it draws inaccurate
conclusions. If the information inputted is
biased, the results are also likely to be
biased. For example, The City of Boston
provides a Street Bump app for smartphones.
On a car journey, the app uses the phone’s
accelerometer and GPS data to record
movements due to problems with the roads,
(e.g. potholes) and transmits the data to the
council. Levels of smartphone ownership
among different socioeconomic groups means
more data may be collected from more
affluent areas, rather than those with the
worst roads.
4.3 Intellectual Property
Third parties may have rights over the
databases, software or algorithms used to
analyse data sets. Business must make sure
they have adequate licences to not infringe on
another party’s intellectual property rights.
4.4 Cyber security
If data is valuable to one business, it is likely
to be valuable to others. Businesses must
make sure they have adequate security
measures in place to protect the data they
own and collect. Where personal data is
involved, safeguards may need to be higher.
it quickly becomes one of the main
challenges facing cloud computing – private
solutions should be carefully addressed.
Creating an internal or private cloud will
cause a significant benefit: having all the data
in-house. But IT managers and departments
will need to face building and gluing it all
together by themselves, which can cause one
of the challenges of moving to cloud
computing extremely difficult.
It is important to keep in mind also the steps
that are needed to ensure the smooth
operation of the cloud:
4.5 Competition law
As data is a valuable commodity,
competition authorities may prioritise
investigating how companies use Big Data
and analytics associated. For example, the
German competition authority found that
Facebook abused a dominant market position
through the use of targeted advertising on
consumers.
Other Risks To Consider:
●
●
●
●
●
Unorganized data
Data storage and retention
Cost management
Incompetent analytics
Data privacy.
● Automating as many manual tasks as
possible (which would require an
inventory management system)
● Orchestration of tasks which has to
ensure that each of them is executed
in the right order.
5.2 Managing multiple clouds
Challenges facing cloud computing haven’t
just been concentrated in one, single cloud.
The state of multi-cloud has grown
exponentially in recent years. Companies are
shifting or combining public and private
clouds and, as mentioned earlier, tech giants
like Alibaba and Amazon are leading the
way.
5. CHALLENGES
5.1 Building a private cloud
Although building a private cloud isn’t a top
priority for many organizations, for those
who are likely to implement such a solution,
In the referred survey, 81 percent of
enterprises have a multi-cloud strategy.
Enterprises with a hybrid strategy
(combining public and private clouds) fell
from 58 percent in 2017 to 51 percent in
2018, while organizations with a strategy of
multiple public clouds or multiple private
clouds grew slightly.
5.3 Dealing with data growth
The most obvious challenge associated with
big data is simply storing and analyzing all
that information. In its Digital Universe
report, IDC estimates that the amount of
information stored in the world's IT systems
is doubling about every two years. By 2020,
the total amount will be enough to fill a stack
of tablets that reaches from the earth to the
moon 6.6 times. And enterprises have
responsibility or liability for about 85
percent of that information.
Much of that data is unstructured, meaning
that it doesn't reside in a database.
Documents, photos, audio, videos and other
unstructured data can be difficult to search
and analyze.
When it comes to storage, converged and
hyperconverged
infrastructure
and
software-defined storage can make it easier
for companies to scale their hardware. And
technologies like compression, deduplication
and tiering can reduce the amount of space
and the costs associated with big data
storage.
On the management and analysis side,
enterprises are using tools like NoSQL
databases, Hadoop, Spark, big data analytics
software, business intelligence applications,
artificial intelligence and machine learning to
help them comb through their big data stores
to find the insights their companies need.
5.4 Generating insights in a timely manner
Organizations don't just want to store their
big data, they want to use that big data to
achieve business goals. According to the
NewVantage Partners survey, the most
common goals associated with big data
projects included the following:
1. Decreasing
expenses
through
operational cost efficiencies
2. Establishing a data-driven culture
3. Creating new avenues for innovation
and disruption
4. Accelerating the speed with which
new capabilities and services are
deployed
5. Launching new product and service
offerings
All of those goals can help organizations
become more competitive.”Everyone wants
decision-making to be faster, especially in
banking, insurance, and healthcare."
To achieve that speed, some organizations
are looking to a new generation of ETL and
analytics tools that dramatically reduce the
time it takes to generate reports. They are
investing in software with real-time analytics
capabilities that allows them to respond to
developments
in
the
marketplace
immediately.
5.5 Recruiting and retaining big data talent
But in order to develop, manage and run
those applications that generate insights,
organizations need professionals with big
data skills. That has driven up demand for
big data experts — and big data salaries
have increased dramatically as a result.
The 2017 Robert Half Technology Salary
Guide reported that big data engineers were
earning between $135,000 and $196,000 on
average, while data scientist salaries ranged
from $116,000 to $163, 500. Even business
intelligence analysts were very well paid,
making $118,000 to $138,750 per year.
In order to deal with talent shortages,
organizations have a couple of options. First,
many are increasing their budgets and their
recruitment and retention efforts. Second,
they are offering more training opportunities
to their current staff members in an attempt to
develop the talent they need from within.
Third, many organizations are looking to
technology. They are buying analytics
solutions with self-service and/or machine
learning capabilities.
5.6 Integrating disparate data sources
The variety associated with big data leads to
challenges in data integration. Big data
comes from a lot of different places —
enterprise applications, social media
streams, email systems, employee-created
documents, etc. Combining all that data and
reconciling it so that it can be used to create
reports can be incredibly difficult. Vendors
offer a variety of ETL and data integration
tools designed to make the process easier,
but many enterprises say that they have not
solved the data integration problem yet.
In response, many enterprises are turning to
new technology solutions. In the IDG report,
89 percent of those surveyed said that their
companies planned to invest in new big data
tools in the next 12 to 18 months. When asked
which kind of tools they were planning to
purchase, integration technology was second
on the list, behind data analytics software.
5.7 Validating data
Organizations are getting similar pieces of
data from different systems, and the data in
those different systems doesn't always agree.
For example, the ecommerce system may
show daily sales at a certain level while the
enterprise resource planning (ERP) system
has a slightly different number. Or a
hospital's electronic health record (EHR)
system may have one address for a patient,
while a partner pharmacy has a different
address on record.
The process of getting those records to agree,
as well as making sure the records are
accurate, usable and secure, is called data
governance. And in the AtScale 2016 Big
Data Maturity Survey, the fastest-growing
area of concern cited by respondents was
data governance. [11]
Solving data governance challenges is very
complex and is usually requires a
combination of policy changes and
technology. Organizations often set up a
group of people to oversee data governance
and write a set of policies and procedures.
They may also invest in data management
solutions designed to simplify data
governance and help ensure the accuracy of
big data stores — and the insights derived
from them.
5.8 Securing big data
Security is also a big concern for
organizations with big data stores. After all,
some big data stores can be attractive targets
for hackers or advanced persistent threats
(APTs).
However, most organizations seem to believe
that their existing data security methods are
sufficient for their big data needs as well. In
the IDG survey, less than half of those
surveyed (39 percent) said that they were
using additional security measures for their
big data repositories or analyses. Among
those who do use additional measures, the
most popular include identity and access
control (59 percent), data encryption (52
percent) and data segregation (42 percent).
6. FUTURE WORK
Here’s the next generation of big data
technology, it should be possible to manage
both traditional and new data sets together on
a single cloud platform. This allows us to use
the data storage, the object store that’s native
to the cloud infrastructure, and the compute
capabilities to the cloud infrastructure, out of
the box. No more setting up and managing
Hadoop clusters, no more provisioning
hardware. It’s a paradigm shift in how we
think about data management because now,
the cloud is the data platform. It also enables
us to allow any user to work with any kind of
data quickly, securely and efficiently in a
way that fits our immediate business needs.
7. CONCLUSION
5.9 Organizational resistance
It is not only the technological aspects of big
data that can be challenging — people can be
an issue too.
In the NewVantage Partners survey, 85.5
percent of those surveyed said that their firms
were committed to creating a data-driven
culture, but only 37.1 percent said they had
been successful with those efforts. When
asked about the impediments to that culture
shift, respondents pointed to three big
obstacles within their organizations:
● Insufficient organizational alignment
(4.6 percent)
● Lack of middle management adoption
and understanding (41.0 percent)
● Business resistance or lack of
understanding (41.0 percent)
There is a generational shift now in how
we’re thinking about big data processing.
The big data platform of the future is highly
performant, scalable, elastic—and in the
cloud. We really don’t need to stand up and
maintain our own big data infrastructure
anymore, since all the capabilities we need
are available in the cloud today.
Many organizations have benefited through
implementation of cloud in their business.
The cloud is a cost effective solution which
caters to many needs of a company and is
being adopted by enterprises in every
industry. Cloud computing makes it easy for
IT heads to store and analyze their data with
the help of optimum computing resources. It
has significant advantages over a traditional
system, however, in some instances, cloud
platforms have to be integrated
traditional architectures.
with
Decision-makers are sometimes faced with a
dilemma over the fact that is cloud really the
solution to their big data project. Big data
exhibits
unpredictable
and immense
computing power and storage needs.
The traditional systems have proved to be
slower since storing data and managing the
same is a very time-consuming and tedious
process, and so it is vital to look for a
solution which will do the trick. Since the
adoption of cloud by many organizations, it
has been providing all the resources to run
multiple virtual servers in cloud database
seamlessly within a matter of minutes. The
compute power and ability to store data,
which the cloud provides greatly, results in
agility in business functions. Huge amounts of
data can be processed within minimum time
frame due to flexible resource allocation. It
is not easy to switch to a different technology
because there are many factors which need to
be taken into consideration. Organizations
have a certain budget when they wish to
switch to a newer technology and in this
case, cloud is a blessing being an avant garde
technology under a budget. Companies are
free to choose the services they need
according to their business and budget
requirements. Applications and resources
which are needed to manage big data don’t
cost much and can be easily implemented by
enterprises. We only pay for the amount of
storage space we use and no additional
charges will be incurred.
When there are high business demands,
traditional solutions require extra physical
servers in the cluster to cater to the needs
with maximum processing power and storage
space but the virtual nature of cloud allows
allocation of resources on demand for smooth
functioning of business. Scaling is a great
option to get the desired processing power
and storage space whenever required. Big
data requires high data processing platform
for analytics and there can be variations in
demand which would be satisfied by only the
cloud environment. The right management,
monitoring and tools should be available to
analyze big data on cloud. There are various
big data analytics platforms like Apache
Hadoop,
a java-based programming
framework which processes structured and
unstructured data.
Big Data and Cloud Computing has truly
changed the way organizations process their
data and implement it in their business. These
technologies have impacted businesses in a
good way because every decision made
through analysis of big data leads to the
success of a business. The future is bright as
we can see more growth for cloud computing
and big data analytics
References [1]https://searchdatamanagement.techtarget.com/definition/Hadoop
[2]https://beyondcorner.com/learn-apache-hadoop/hadoop-ecosystem-architecture-components/
[3]https://www.researchgate.net/publication/325011725_Review_Paper_on_Big_Data_Analytics
_in_Cloud_Computing
[4]https://www.guru99.com/introduction-to-mapreduce.html
[5]https://en.wikipedia.org/wiki/MapReduce
[6]https://www.researchgate.net/publication/316051568_Big_Data_and_Cloud_Computing_Trend
s_and_Challenges
[7]https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-HDF
S
[8] https://data-flair.training/blogs/hadoop-hdfs-tutorial/
[9] https://blogs.oracle.com/bigdata/big-data-future-is-cloud
[10] https://yourstory.com/mystory/3ddbbf1fb6-big-data-and-cloud-com
[11] https://www.datapine.com/blog/cloud-computing-risks-and-challenges
Download