Mining Big Data - Academic Science,International Journal of

advertisement
Mining Big Data
Akshat Shah
Vidhi Shah
Sindhu Nair Faculty
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
akshatshah1710@gmail.com
vidhi240194@gmail.com
Sindhu.Nair@djsce.ac.in
Big Data, Data mining, datasets, HACE theorem, Hadoop,
MapReduce, 3V’s
hacker tools also increased, which with the help of big data analytics
tools will acquire computing resources and create security setbacks
that never existed before [5].
Nowadays we have been witnessing an ever increasing growth of
data which produces a serious threat to the security and privacy of
the data. It is necessary to safeguard the data from such threats since
we remain at high risks because privacy may be compromised with
rise of new technologies [3]. We are facing various challenges in
leveraging the vast amount of data, including (1)
system
capabilities (2) algorithmic design (3) business models (4) security
(5) privacy. As an example, Big Data is having in the data mining
community, which is being considered as one of the most exciting
opportunities in the years to come [3]. Moreover this, Secure
Management of Big Data with today’s threat spectrum is also a
biggest challenging problem. This paper shows that significant
research effort is needed to build a generic architectural framework
towards addressing these security and privacy challenges in a
holistic manner as future work.
1. INTRODUCTION
2. MINING BIG DATA
ABSTRACT
Collection of large and complex datasets are generally called Big
Data. This large datasets cannot be stored, managed or analyzed
easily by the current tools and methodologies because of their large
size and complexity. However such datasets provide various
opportunities like modelling or predicting the future. This
overwhelming growth of data is now coupled with various new
challenges. Increase in data at a massive rate has resulted in most
exciting opportunities for researchers in the upcoming years. In this
paper, we discuss about the topic in detail, its current scenario,
characteristics and challenges to forecast the future. This paper
discusses about the tools and technologies used to manage these
large datasets and also some of the problems and essential issues
such as security management and privacy of data.
Keywords
The technological advancement during the recent years resulted in a
meteoric increase in the data which are collected from various
sensors, devices, in different formats, from independent or
connected applications [1][2][3]. This dramatic increase in datasets
surpassed our capability to process, analyze, store and understand
these datasets [2]. According to Google, in 1998 the web pages that
were indexed were 1 million but it reaches 1 billion in the year 2000
and already exceeded 1 trillion in 2008. The rise of the social
networking sites such as Facebook, Twitter etc., allowed users to
freely create contents and drastically increased the Web Volume.
With the advent of android and other operating systems, mobile
phones became the centre piece for collection of data. It allowed
collection of real time data of people from different aspects and
increased the CDR (call data record) based processing. Internet of
Things (IoT) can also be considered an important factor which led to
an unprecedented increase in the data [1][2][3]. Real time analytics
is required in order to mange this rapidly increasing data from
various networking applications, emails, blogs, tweets, posts and
others [4]. With this it also increases the security risks and affects
the privacy of the sensitive information, or the integrity of results of
data analytics may be used by different enterprises to understand the
general trends, problems and also to predict future.These data occurs
in various forms such as text, images, video and hence establishing
reliability and trustworthiness, as well as completeness of data from
different sources become very difficult [5]. These worsens the
problem of providing overall quality of data. This also generates
security and privacy risks as they can be used to reveal very
sensitive and confidential information [5]. Advancement of network
technologies, mobile and cloud computing possess great threat to
process unfiltered data in motion. The increasing volume, velocity,
and variety of data present an unprecedented level of security and
privacy challenges with respect to Big Data [5]. This increased the
number of threats in short periods of time and resulted in many
hacking and cyber crimes over the years. With these the number of
Big data is a new term used to identify the datasets that due to their
large size and complexity, cannot be managed efficiently with their
current methodologies or mining software tools. Data can be
extracted from large data sets or streams of data easily which was
earlier not possible because of its volume, variability and velocity
[3][11][12][13]. The big data challenge is becoming one of the most
exciting opportunities for the next years. Big Data are generally of
two types; structured and unstructured. The data that can be easily
categorized and analyzed is called structured data. The data from
various smartphones, GPS devices, sales figures, account balances
are all structured data. The information such as comments on social
media, photos, reviews, tweets are all unstructured data. Big Data
mining for structured data is comparatively easier than the
unstructured data [6][7]. New algorithms and new tools to deal with
this collected data.
Volume
Variety
Velocity
Figure 1. 3V’s in Big Data Management [2][3]
Doug Laney [2] talked first about 3 V’s in Big Data Management :
Variety: There are many different types of data such as numerical
data, images, videos, etc.
Volume: Data is ever increasing, but not the amount of data that
present tools can analyse.
Velocity: Data is arriving continuously as streams of data, and we
are interested in obtaining useful information in real time.
Generally data mining for Big Data analyzes data and extracts useful
data from it. Data mining contains several algorithms which fall into
following four categories. They are
Association: It is used to search relationship between variables and
applied in searching for frequently visited items.
Clustering: It discovers structures and groups in the data and
classifies the data according to its group.
Regression: It finds a function to model the data.
Classification: It associates an unknown structure to a known
structure.
These algorithms can be converted into big data map reduce
algorithm. Data clustering is also attracted during the recent years.
However the enlarging data makes it a challenging task [10].
3. BIG DATA CHARACTERISTICS: HACE
THEOREM
Big Data starts with large volume, heterogeneous, autonomous
sources seeks to explore complex and evolving relationships among
data. Therefore it becomes a challenging task in order to discover
useful information from these data. HACE theorem suggests the
key characteristics of the Big Data [20]. They are
3.1. Huge Data with Heterogeneous and Diverse
Dimensionality
One of the important characteristic of Big Data is its heterogeneity
and diverse dimensionality in its representation. This heterogeneity
exists because different information is represented in different
schemas according to users convenience. The schema for different
applications depends on its nature and also results in diverse
representations of the data. For example, an individual in a biomedical world can be represented by using demographic information
such as sex, age, past medical history, allergies etc. For X-Ray
examination of patients, images or videos are used to represent the
results to examine the patient. For DNA test, microarray expression
images are used to represent the genetic code information. Therefore
the different types of representations for the same individuals is
called heterogeneous features, and variety of features involved to
represent each single observation are called diverse features [14].
3.2. Autonomous Sources with Distributed and
Decentralized Control
Autonomous sources enable to generate and collect information
from various sources without any centralized control. Moreover the
data might be corrupted or is vulnerable to threats, therefore these
large volume of data is generally distributed across many servers.
Consider the World Wide Web (WWW) in which each web server
provides certain amount of information and is able to function fully
without relying on other servers [14].
3.3. Complex and Evolving Relationships
Complex data consists of multi structure, multisource data such as
types of bills, documents, images, videos etc. This requires a
complicated approach in order to deal with Big Data [14].
4. PROBLEM ARISES
Increase in data has resulted in increasing enthusiasm about the
possibilities of measuring and monitoring Big Data in order to
explore these data for some useful purposes. There is a major
problem with Big Data since it was not developed for statistical
purposes and increases the risk of yielding hypothetical figures.
Data extraction from Big data sometimes results in poor analytics
because the data obtained from digital media, electronics, social
media websites, sensors are largely unstructured and unfiltered and
unlike the traditional data sources which are well structured and
used for good analytics, these involve a fairly high cost (for data
collection) [17][18].
Big Data mining often results into risk of drawing misleading
conclusions because of large data sets. There are several problems
associated with Big Data mining:

Bias in Selection: The data mining is mostly performed on
observational data which means there is no control over
the treatments, conditions of the object being studied
whereas in experimental studies, such control is well
exercised.

Data may be out of date: In businesses, data mining is to
be generally performed on streaming data. Past data is of
seldom use in mining applications.

Empirical rather than Iconic Models: Iconic models are
mathematical representations whereas empirical models
are based on finding convenient summaries of a data set.
Most data mining models are empirical which yield
comparatively less accurate results as compared to iconic
models.

Performance Measurement: Since different performance
criteria are likely to yield different orders of merit, it is
important to choose a criterion which closely
matches the objectives of the analysis.
5. CHALLENGES IN BIG DATA
Big Data mining refers to the collection of data sets with sizes
outside the ability of software tools such as database management
tools or other traditional tools to capture, manage and analyze data
within a stipulated time period. Since data across the Internet is
increasing continuously, hence meeting this challenges is quite
difficult.
Big Data provides extraordinary benefits to the world economy in
the field of national security and also in areas ranging from
marketing to risk analysis in medical research and urban planning.
However the concerns of privacy and data protection outweigh the
benefits.
Big Data needs to be continuously verified because data sources can
be easily corrupted by adding malicious data. Hence information
security is becoming big data analytics problem where massive data
sets will be explored. Any security control must meet following
requirements.

Basic functionalities should not be compromised.

Big Data characteristics should not be compromised.

Security threats should be addressed properly.
Security can be enhanced in order to provide tight security to the
data by using following techniques:

Authentication Methods: A process that verifies the user
before accessing the system. For example, Kerberos

Encryption: This ensures confidentiality and privacy of
user information and secures vital information. It protects
data even if malicious user gains access to the system.

Access Control: It specifies access control priviledges to
users.

Logging: To detect attacks, diagnose failures or
investigate unusual behavior, logging helps to keep record
of activity. It gives a place to look up when something
fails, or system is hacked.

Secure Communication: It can be established between
nodes and between nodes and applications which requires
an SSl/TLS that protects all network communications.
6. BIG DATA FOR DEVELOPMENT:
CHALLENGES AND OPPORTUNITIES :
Real-time awareness: Programs and policies with more refined
representation of reality [15].
Early warning: Response in time of crisis, detecting anomalies in
the usage of digital media [15].
Real-time feedback: Checking which programs fail by monitoring
in real time and making changes accordingly [15].
Big Data for Development sources share some of the following
features:
Digitally Generated: Data is produced digitally and stored
sequentially and can be manipulated easily [16].
Passively Produced: Data is generally a by product of our daily
lives or interaction with digital services [16].
Automatically Collected: Data is generated and collected
automatically [16].
Trackable: Location can be tracked from the data [16].
Continuously Analysed: Information is relevant to human well –
being and development and can be analysed in real time [16].
the HDFS is managed through a dedicated NameNode server to host
file system index and a secondary NameNode to generate snapshots
of the memory structures, in order to prevent loss of data [24].
7. TOOLS AND TECHNIQUES FOR BIG
DATA MINING
Big Data provides extraordinary benefits to big firms to produce
useful information which help to manage their problems. Big Data
analysis involves automatic discovery of information and analyzing
the underlying patterns and hidden rules. These data sets are too
large and complex for humans to extract information manually
without the use of any computational tool. Various emerging
technologies help to process and transform Big Data into
meaningful knowledge.
7.1. Hadoop:
Hadoop is a scalable, open source, fault tolerant Virtual Grid
operating system architecture for data storage and processing which
runs on commodity hardware and uses a fault-tolerant high
bandwidth clustered storage architecture called HDFS. For
distributed data processing, it runs MapReduce and works with both
structured as well as unstructured data. Tools like Hive, Pig and
Mahout are used in order to handle the velocity and heterogeneity of
data which are parts of Hadoop and HDFS framework. Hadoop and
HDFS ( Hadoop Distributed File System ) by Apache is widely
used for storing and managing big data [3][21][22].
Hadoop consists of Hadoop Distributed File System (HDFS) which
stores the data and MapReduce that processes the data at an efficient
rate. Hadoop splits files into large blocks. It then distributes them
amongst the nodes in the cluster. The Hadoop MapReduce transfers
packaged code to process the data in parallel. The advantage of this
approach is data locality-nodes manipulate data they have on hand.
This allows data processing to be more faster and efficient.
Hadoop framework is composed of the following modules:
 Hadoop Common – contains libraries and utilities;
 Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines;
 Hadoop YARN – a resource-management platform
responsible for managing computing resources in clusters
and using them for scheduling of users' applications;
 Hadoop MapReduce – a programming model for large
scale data processing.
Hadoop package consists of a MapReduce engine, file system,
Hadoop Distributed File System (HDFS) and OS level abstractions.
Hadoop clusters can be small or large. A small Hadoop cluster
consists of a single master node which consists of Job Tracker, Task
Tracker, Name node, and Data node and multiple worker nodes that
act as both a Data node as well as a Task tracker. In a large cluster,
Figure 2. Hadoop Cluster [24]
Hadoop Distributed File System: A Hadoop cluster has a single
Name nodes plus a cluster of Data nodes in order to avoid
redundancy. HDFS uses a block protocol over the network in order
to serve blocks of data. It uses TCP/IP socket for communication. It
stores large files across multiple machines. Replicating the data
across multiple hosts improves reliability and provides efficient
storage.
Job Tracker and Task Tracker: The client applications submit the
MapReduce jobs to the Job tracker which pass on the work to the
Task tracker nodes.
Scheduling: Hadoop framework makes use of First-In First-Out
(FIFO) approach for scheduling.
MapReduce: It is a programming model used to process large data
sets with a parallel, distributed algorithm on a cluster. It is a
software framework for writing applications that rapidly process
vast amounts of data in parallel on large clusters of computer nodes
[19][23]. It consists of two methods, map() which performs the task
of filtering and sorting and reduce() which summarizes the result.
7.2. S4:
It provides a platform for processing continuous data streams. In
order to manage datastreams S4 is generally used. S4 apps are
designed combining streams and processing elements in real time
[2][3].
7.3. Storm:
This software is used for streaming data-intensive distributed
applications, similar to S4, and developed by Nathan Marz at
Twitter [1][3].
8. FORECAST TO THE FUTURE
Big Data mining will face important challenges in the future due to
the nature of big data. There are many other challenges which
researchers will have to deal with:
Analytics Architecture: Researchers need to find out an optimal
architecture in order to deal with historic data as well as increasing
streaming data at the same time. One example for such an
architecture is the Lambda architecture of Nathan Marz.
Time evolving data: It is important that Big Data mining deal with
the data that evolve over time. It must adapt in some cases to detect
change first.
Distributed mining: Many data mining are not trivial to paralyze.
Research is needed to perform in order to have distributed versions
of some methods [22].
Compression: Ever increasing data means the amount of space
needed to store the data also needs to be considered. Hence data
needs to be compressed which may take more time but less space is
required.
Statistical Significance: Statistically significant results are required
in order to understand the data properly.
Hence, this section dealt with challenges related to forecast the
future. Finally in the next section, we conclude this paper in brief.
9. CONCLUSION
Due to expansion of Internet, the data is increasing at tremendous
rate and hence Big Data is going to continue growing during next
years and will become one of the exciting opportunities in the near
future. This paper provides an insight about the topic, its
characteristics, challenges etc. for the future. Hence Big Data is
becoming important for scientific data research and business
applications. Security Management and Privacy protection is one of
the many challenges faced by Big Data and is becoming a big issue
because of ever increasing growth of data in terms of volume,
velocity and variety.
From security point of view, the threats associated to this data is
increasing at an unprecedented rate. So future research is needed to
find out an optimal architecture in order to manage this
overwhelming data and also to address the security and privacy
challenges in a holistic manner. Now we are in a new era where
Big Data mining will help us to discover knowledge that no one
has discovered before.
REFERENCES
[1] http://big-data-mining.org/
[2] Ms. Neha A. Kandalkar , Prof. Avinash Wadhe,” Extracting
Large Data using Big Data Mining”, International Journal
of Engineering Trends and Technology (IJETT) –Volume 9
Number 11 - Mar 2014
[3] Wei Fan, Albert Bifet,” Mining Big Data: Current Status, and
Forecast to the Future”
[4] J. Gama. Knowledge discovery from data streams.
Chapman & Hall/CRC, 2010.
[5] James Joshi, Balaji Palanisamy,” Towards Risk aware Policy
based Framework for Big Data Security and Privacy”, 2014
[6] Bharti Thakur, Manish Mann,” Data Mining for Big Data: A
Review”, International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 4, Issue
5, May 2014, ISSN: 2277 128X.
[7] http://www.bls.gov/careeroutlook/2013/fall/art01.pdf
[8] UN Global Pulse, http://www.unglobalpulse.org.
[9] Puneet Singh Duggal, Sanchita Paul, (2013), “Big Data
Analysis:Challenges and Solutions”, Int. Conf. on Cloud, Big
Data and Trust, RGPV
[10] Dr. A.N. Nandhakumar, Nandita Yambem „A Survey of Data
Mining Algorithms on Apache Hadoop Platforms‟, IJETAC,
Volume 4, Issue 1, January 2014.
[11] D. Pratiba, G. Shobha, Ph. D,” Educational BigData Mining
Approach in Cloud: Reviewing the Trend”, International
Journal of Computer Applications (0975 –8887) Volume 92 –
No.13, April 2014.
[12] Mrs. Deepali Kishor Jadhav,” Big Data: The New Challenges
in Data Mining”, IJIRCST,ISSN: 2347-5552, Volume-1, Issue2, September,2013.
[13] Harshawardhan S. Bhosale , Prof. Devendra P. Gadekar,” A
Review Paper on Big Data and Hadoop”, International Journal
of Scientific and Research Publications, Volume 4, Issue 10,
October 2014 1 ISSN 2250-3153.
[14] Xindong Wu, Xingquan Zhu, Gong Qing Wu, Wei Ding, Data
mining with Big data‟, IEEE, Volume 26, Issue 1, January
2014.
[15] http://albertbifet.com/global-pulse-big-data-for-development/
[16] http://www.ciard.net/sites/default/files/bb40_backgroundnote.p
df
[17] http://paris21.org/newsletter/fall2013/big-data-dr-jose-ramonalbert
[18] Xindong Wu, Xingquan Zhu , Gong-Qing Wu, Wei Ding,”
Data Mining with Big Data”
[19] Richa Gupta, Sunny Gupta, Anuradha Singhal, (2014), “Big
Data:Overview”, IJCTT, 9 (5)
[20] http://www.ijarcsse.com/docs/papers/Volume_4/5_May2014/V
4I5-0328.pdf
[21] http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=
7021746
[22] http://www.academia.edu/10083723/ISSUES_CHALLENGES
_AND_SOLUTIONS_BIG_DATA_MINING
[23] http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.ht
ml
[24] https://en.wikipedia.org/wiki/Apache_Hadoop#Architecture
Download