A Survey on Issues and Challenges in Handling Big Data

advertisement
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
A Survey on Issues and Challenges in Handling
Big Data
Sandeep K N
Usha R G
Dept of Information Science and Engineering
Dept of Information Science and Engineering
JSS Academy of Technical Education, Bangalore,
India.
JSS Academy of Technical Education, Bangalore,
India.
knsandeep7@gmail.com
usha.r.g1218@gmail.com
Abstract: Since data is the essential characteristics in today’s technology and as the data ranges from Gigabytes
to Terabytes, Petabytes and Exabytes, the large pool of data can be brought together and analyzed by using
“Big Data”. Big Data is a collection of large amount of data sets that is being generated from every phone,
website and application across the Internet. Due to the huge volume and the speed at which it is generated, it is
very difficult for the machine to maintain and process Big Data. Hence Hadoop is used to manage it. The
technologies used by Big Data are Hadoop, Map Reduce, Hive, NoSQL database etc. This paper includes
features, functionalities and challenges of Big data, Hadoop, HDFS, Map Reduce.
Keywords-Big data, Hadoop, HDFS, Map Reduce.
1. INTRODUCTION
In ancient days, as there were no technology, people
used their own ideas to store their data on woods by
using charcoals and carving on the stones. As days
passed, man used primitive ways of storing data on
paper, clothes. Later, new inventions and discoveries
made him to store data in vacuum tubes, magnetic
tapes, floppy disks, CD-ROM, hard disk, pen drive,
memory cards, Blurays etc. From this trend, to
accumulate huge amount of data, technology has made
a drastic change by using Big Data[3].
With the immense growth of technological
development, production and services, large amount of
data is formed which can be structured, semistructured, and unstructured from the different sources
in different domains. In daily routines, people store
large amount of data in facebook, twitter, google
drives, mail, you tube etc. These companies has to
provide drives for storing huge amount of data. Due to
the massive use of storage, the need of Big Data came
into existence.
Big Data is a collection of large amount of data sets
that is being generated. In 1990, people were usually
using 1GB-20GB capacity of hard disk. Big Data
“size” is constantly moving target, as of 2012 ranging
from few dozen terabytes to many petabytes of data.
Big Data requires set of techniques and technologies
with new form of integration to reveal insights from
data sets that are diverse, complex and of massive
scale. In future years we may arrive into the situation
where we need thousands of zetabytes of hard disk to
ISSN: 2231-5381
store the data. Due to the increase in storage of data,
we need Big Data. The need of big data generated
from large companies like facebook, yahoo, you tube,
google etc for the purpose of analysis of enormous
amount of data which is in unstructured or even in
structured form.
Figure 1: Big Data[1]
New skills are needed to fully harness the power of big
data. Though courses are being offered to prepare a
new generation of big data experts, it will take some
time to get them into the workforce. Leading
organizations are developing new roles, focusing on
key challenges and creating new business models to
gain the most from big data [4]. The big data includes
data produced by different devices. The different
sources of Big data are as given below [5]


http://www.ijettjournal.org
Black Box Data –It is a components used in
airplanes, jet and helicopter etc. It records
voice of flight crew.
Social Media Data –Social media like
whatsapp, facebook and twitter stores the
Page 164
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016




various data and views posted by various
people all around the globe.
Stock Exchange Data –Stock exchange holds
the information about buy and sell decisions
made by various companies.
Power Grid Data-The power grid data holds
information consumed by a particular node
with respect to a base station.
Transport Data -It stores the information
about model, capacity, distance and
availability of vehicle.
Search Engine Data-It retrieves lot of data
from various databases.
There are various technologies in the market from
different vendors including Amazon, IBM, Microsoft,
etc., to handle big data.
2. CHARACTERISTICS OF BIG DATA
The seven V’s of Big Data are:
analyzed. Veracity in data analyze is the biggest
challenge when compared to volume and velocity. The
quality of data vary greatly from one data to another.
Precision of data analysis depends on veracity of
source data.
Visualization- Data visualization is a general term that
describes any effort to help people understand the
significance of data by placing it in a visual context.
Data visualization are everywhere and more important
than ever. From creating a visual representation of data
points as part of an executive presentation, to
showcasing progress, or even visualizing concepts for
customer segments, data visualizations are a critical
and valuable tool in many different situation.
Value- Value starts and ends with business use case.
The business must define the analytic application of
data and its potential associated value to the business.
The potential value of big data is huge. The value lies
in rigorous analysis of accurate data, information and
insights this provides.
Volume- With the advancement of technology, data
that is generated and collected is rapidly increasing. If
the volume is in gigabytes it is probably not Big Data,
but at the terabyte and petabyte and beyond it may
very well be. Volume is a key contributor to the
problem of why traditional relational database
management system (RDBMS) fail to handle Big
Data. The volume determines the actual quantity of
data.
Velocity- Velocity refers to increasing speed at which
the data is created, and the increasing speed at which
data can be processed, stored and analyzed by
relational database. It simply describes the data-at-rest
and data-in-motion. Sending the data and fetching the
data requires some velocity. Velocity in big data refers
to how fast the data is generated. Velocity also
incorporates the characteristics of timeliness or latency
- is the data being captured at a rate or with a lag time
that makes it useful.
Variety- It refers to the different types of data
generated and how the data is stored. The data can be
structured, semi-structured or unstructured data. Legal
records, data in RDBMS, etc belongs to the structured
data. Blogs, Log files, emails are the good example for
semi-structured data. Unstructured data are stored in
the form of audio, video, images, text, graphs and the
output from all types of machine-generated data from
sensors, devices, cell phone GPS signals, DNA
analysis devices and so on.
Variability- As data changes from time to time, it
causes inconsistency. This is particularly the case
when gathering data relies on language processing.
Thus causing problem to manage and handle
efficiently.
Veracity- Big Data refers to the biases, noise and
abnormality in the data. It is the data that is being
stored and mined meaningful to the problem being
ISSN: 2231-5381
Figure 2:Characteristics of Big Data
Big Data cannot be stored on a single machine. It is
normally stored in a multiple machines. Internally
there should be a structure so that multiple machines
can club their data and provide it to the end user.
3. ISSUES OF BIG DATA




Data access and connectivity can be hindrance.
Processing time increases, as the data size is
increased. Hence immediate retrieval of
important information may be impossible.
Incomplete data also creates uncertainties and
correcting these data leads to difficulty.
Incomplete refers to missing data and hence
some algorithms are used to overcome it.
Storing and managing huge amount of data is
quite difficult. And also retrieval is also a
major challenge.
http://www.ijettjournal.org
Page 165
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016

Difficulties arise from the heterogeneous
mixture of data because the data formats and
patterns vary greatly. Data can be in the form
of structured, semi-structured and unstructured
form. Converting unstructured data to
structured format is a major challenge.
4. CHALLENGES OF BIG DATA





In today’s business environment, along with
storing and finding the relevant data, accessing
must also be quickly. As huge amount of data
is stored, accessing speed may decrease.
Hence reliable hardware must be used.
Even though if we can find and analyze the
data quickly, the major challenge is to have the
accurate and valuable data. Hence data quality
must be assured.
Understanding the data takes a lot of time.
Hence we should have people from expertise
domains and should have a good
understanding knowledge.
Identifying
the
data
collected
and
implementing the right solution to accurately
analyze the data.
It should address a security threat to big data
environments or data stored within a cluster.
Hadoop is designed to store huge data sets and is not
recommended for small data sets. Hadoop has five
services:
1.
2.
3.
4.
5.
Name node
Secondary name node
Job tracker
Data node
Task tracker
The first 3 services are called as Master services or
Master nodes. The last 2 services are called as Slave
services or Slave nodes. Every master services can talk
to each other and every slave services can talk to each
other. If name node is a master service then data node
is the corresponding slave service. And if the job
tracker is the master service then task tracker is the
slave service.
5. HADOOP AND HDFS
Since we store huge amount of data, the processing
time should be decreased in order to achieve
efficiency. The best solution for this is Hadoop. The
founder of Hadoop is Doug Cutting.
Hadoop is an open source, java based programming
framework given by apache software foundation for
storing and processing huge data sets with clusters of
commodity hardware. Hadoop is designed to scale up
from single server to thousands of machines, each
offering local computation and storage.
Hadoop framework includes following four models[6]:
1.
2.
3.
4.
Hadoop common
Hadoop YARN
Hadoop Distributed File System(HDFS)
Hadoop MapReduce
Hadoop Distributed File System(HDFS)- It is a
technique for storing huge number of data with
streaming access pattern and with cluster of
commodity hardware. Streaming access pattern refers
to write once and read any number of times. HDFS has
a default block size of 64MB. The block size can also
be increased. In normal Operating system, if we store
2KB of data in 4KB block size, remaining space is
wasted. But in case of HDFS, the space is not wasted.
ISSN: 2231-5381
Figure 3: Master Slave Architecture of Hadoop[2]
Machine(client) uses name node services to store the
huge amount of data. Name node maintains the
metadata that keep track of all the information about
storage. The data will be stored in data node.
HDFS provides backup by storing multiple copies of
data in case of data loss. Name node is called as single
point of failure because if name node is lost, then
nothing can be accessible.
If a program needs to access the data stored in a data
node, then job tracker requests the name node for
accessing the data. Name node responses by giving the
metadata to the job tracker. Job tracker assigns task to
the task tracker. Task tracker chooses the nearest
system (i.e., one which is nearer among 3
replications/copies) and processes. This process is
called map. The files which are divided to store in
HDFS are called input splits. The number of input
splits is equal to the number of mappers.
6. MAP REDUCE
MapReduce is a programming model for processing
large-scale datasets in computer clusters. Map reduce
is a core component of the Apache Hadoop software
framework. Mapreduce operation includes:
http://www.ijettjournal.org
Page 166
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016




Specify computation in terms of map and
reduce function.
Parallel computation across large-scale
clusters of machine.
Handle machine failures and performance
issues.
Ensure efficient communication between the
nodes.
The main reason to perform mapping and reducing is
to speed up the execution of a specific process by
splitting the process into a number of tasks, thus
enabling parallel work. The MapReduce programming
model consists of two functions, Map() method that
performs filtering and sorting and Reduce() method
that performs summary operation. Hadoop runs the
map reduce in the form of (key, value) pairs. A
MapReduce cluster employs a master-slave
architecture. The use of this model to reduce network
communication cost. Optimizing the communication
cost is essential to a good MapReduce algorithm. The
following are the Mapreduce components:
1. Name Node - It manages the HDFS metadata.
2. Data Node – It stores blocks of HDFS – default
replication level for each block - 3
3. Job Tracker – It manages jobs and resources in a
cluster.
4. Task Tracker – It runs Map Reduce operations.
Word count example of a MapReduce
Figure 5:MapReduce word count[7]
Map execution consists of following phases:
Map
phase




Sort
phase
Reduce
phase
Reads the input splits from HDFS.
Parses input into records (key, value) pairs.
Applies map function into each record.
Informs master node of its completion.
Partition phase:


Name
Node
Shuffle
phase
Figure 6: Execution flow of data
Map phase:

Data
Node
Partition
phase
Each mapper must determine which reducer
will receive each of the outputs.
For any key, the destination partition is same.
Number of partitions=Number of reducers.
Shuffle phase:
Task
Tracker

Job
Tracker
Fetches input data from all map tasks for the
portion corresponding to the reduce tasks.
Sort phase:

Figure 4: Components of MapReduce
Users specify a map function that processes a
key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all
intermediate values associated with the same
intermediate key.
Reduce phase:



ISSN: 2231-5381
Merge-sorts all map outputs into a single run.
http://www.ijettjournal.org
Applies user-defined reduce function to the
merged run.
Arguments: key and corresponding list of
values.
Writes output to a file in HDFS.
Page 167
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
7. CONCLUSION
Today, big data is no longer an experimental tool. Since the data is growing exponentially all over the world, Big data
is becoming new area for research and business applications. The analysis of big data helps business people to make
better decisions. Many companies have begun to achieve the results with this approach. Big data technologies like
Hadoop and MapReduce provides many advantages.
REFERENCES
[1] http://www.isymmetry.com
[2] http://www.rosebt.com/blog/hadooparchitecture-and-deployment
[3] Sudha P. R , Assistant Professor, JSSATE, Bangalore, “A Survey on MapReduce, Hadoop and YARN in Handing
Big Data”. International Journal of Advanced Research in Computer Science and Software Engineering, Volume 6,
Issue 1, January 2016, ISSN: 2277 128X
[4] http://www.ibm.com/big-data/us/en/
[5] http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm
[6] http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
[7] http://javax4u.blogspot.in/2012/11/hadoop.html
GUIDED BY,
Sudha P. R
Assistant Professor, Department of ISE,
JSS Academy of Technical Education, Bangalore, India .
AUTHORS’S PROFILE
SANDEEP K N
USHA R G
ISSN: 2231-5381
http://www.ijettjournal.org
Page 168
Download