Document 12914435

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015
A Novel Clustering Model withImproved Map Reducing
Technique overBig Data
K.U.V.Padma 1, Y.Laxmana Rao2, P.Haribabu3
Assistant professor1, Assistant professor2, Assistant professor3
Raghu Institute of Technology, Modavalasa, Andhra Pradesh.
1,2,3
Abstract:
In present days, data clustering is a complex task
while increasing in more amount of data. In most
common methods,data clustering is vector space
model which represents bag of words and does not
represent semantic relations between words. The
purpose of this proposed work is integrating words
with bisecting k-means to utilize the semantic
relations between the data points.Our proposed
model works on big data with Hadoop, the
experimental results shows better performance
results than the traditional approaches.
I. INTRODUCTION
Grouping is the assignment of collection or
an arrangement of items in a manner that questions in
the same gathering (called a grouping) are more
comparative (in some sense or another) to one
another than to those in different gatherings
(bunches). It is a principle undertaking of exploratory
information mining, and a typical procedure for
measurable information investigation, utilized as a
part of numerous fields, including machine learning,
example acknowledgment, picture examination, data
recovery, and bioinformatics.[1]
Ontology has been embraced to improve
report grouping, further it has been utilized to
illuminate the issue of vast number of record
elements [4]. Propelled by the significance of
cosmology and its capacity to improve record
bunching, in this exploration we research the
utilization of WordNet metaphysics [5] with a
specific end goal to diminish the enormous volume of
report elements to only 26 highlights speaking to the
WordNet lexical thing classifications. In addition,
WordNet helps in speaking to semantic relations
between terms.
Cluster investigation itself is not one
particular calculation, but rather the general errand to
be comprehended. It can be accomplished by
different calculations that contrast altogether in their
idea of what constitutes a group and how to
proficiently discover them. Well known ideas of
ISSN: 2231-5381
bunches incorporate gatherings with little separations
among the group individuals, thick ranges of the
information space, interims or specific factual
dispersions. Grouping can in this way be planned as a
multi-target enhancement issue. The suitable
bunching calculation and parameter settings
(counting values, for example, the separation
capacity to utilize, a thickness edge or the quantity of
expected groups) rely on upon the individual
information set and planned utilization of the
outcomes. Group investigation all things considered
is not a programmed assignment, but rather an
iterative procedure of learning revelation or intuitive
multi-target advancement that includes trial and
disappointment. It will frequently be important to
change information pre-processing and show
parameters until the outcome accomplishes the
fancied properties.[2][3]
Although using WordNet to reduce document
dimensionality enhances document clustering
efficiency, the traditional implementation of this
approach is not very efficient due to the huge
volumes of data. Thus, using a parallel programming
paradigm for implementing this solution turns to be a
more appealing approach. Thus, we investigated
running document clustering process in a distributed
framework, by adopting the common MapReduce
programming model [6]. MapReduce is a
programming model initiated by Google’s Team for
processing huge datasets in distributed systems. It
presents simple and powerful interface that enables
automatic parallelization and distribution of largescale computations. In addition, it enables
programmers who have no experience with parallel
and distributed system to utilize the resources of a
large distributed system in easily and efficient way
[6].
The two main problems which face
document clustering process are namely: the huge
volume of data, and the large size of document
features. In this paper we propose a novel approach
to enhance the efficiency of document clustering. The
proposed system is divided into two phases; the
objective of the first phase is to decrease the very
http://www.ijettjournal.org
Page 78
International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015
large number of document terms (also known as
document features) through adopting WordNet
ontology. The second phase aims to cope with the
huge volume of data by employing the Map Reduce
parallel programming paradigm to enhance the
system performance. In the following discussion we
discuss each phase in more details.
knowledge. Historically, Ontology is derived from
branch of philosophy, known as metaphysics.
Ontology mainly describes the following:[8]
1, Individuals
2. Classes
3. Attributes
II. RELATED WORK
4. Relationship
Big Data:
Big data is nothing but large amount of data which is
availability of data, both structured and unstructured.
Big data is arrived from multiple sources at velocity,
volume and variety. It is also derived from exchange
of data between systems using any media. Big data
changes the people who are working in an
organization and work together. It creates the way
that business and IT leaders must join together to get
values from all data. New skills are needed to get the
big data. Some courses are offered to prepare new
generation of big data. But some experts are
developing new roles, focusing the key challenges
and creating new models to get the most from big
data. 4.4 million Data scientists are needed by
2015.[7]
Individuals:
Individuals are also called as Instances. Individuals
are basic components in ontology. It may include
different objects like people, animals, automobiles,
molecules and planets as well as numbers and words.
Classes:
Classes are also called as concepts. Class is collection
of objects. Class includes individuals, other classes o
may be both. Some examples of classes are of the
following;
Person: class of all people
Molecule: class of all molecules
Data Scientists and their responsibilities:
Number: class of all numbers
Data scientists are responsible for designing
and implementing processes and layouts for large
scale data sets used for modeling, data mining and
research processes. The role of Data scientists is to
work on different projects at a time.
Attributes:
Objects can be described by assigning values to
the attributes. Each attribute has specific name and
value which are used to store the information. For
example, a car has following attribute [4][9].
Responsibilities:




Name: Ford Explorer
According to business needs, they have to
develop and plan the required analytical
projects.
Works with application developers to
extract the required data relevant to data
analysis.
Creates new data definition of data tables
or files and develops the existing data for
analysis for further usage.
Manages and guides to the juniors in the
team.
Number of doors: 4
Transmission: 6-speed
Relationship:
An important use of an attribute is to
describe the relationships between objects in
ontology. A relation is an object whose attribute is
another object in ontology.
III. PROPOSED WORK
Ontology:
Ontology is the study of what things that exist. In
Information technology, ontology is the working
model of entities and interactions in the domain of
ISSN: 2231-5381
Map reduce is a programming model which
is an associated implementation for processing and
generating large data sets with a parallel and
http://www.ijettjournal.org
Page 79
International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015
distributed algorithm. In Map reduce model, the data
primitives are called mappers and reducers. Once we
write an application in Map Reduce form, we can run
the application in any number of times in a machine
without any interruption. The major advantage of
Map Reduce is that is easy for data processing over
multiple computing nodes. Map Reduce program
executes in following stages:
Map Stage:
The mapper’s job is to take set of data and
converts it into another set of data, where individual
elements are broken into tuples (key/value pairs). The
input data is in the form of file and directory.
Reduce Stage:
This stage is the combination of Shuffle
stage and Reduce stage. The output of map stage is
taken as input to the Reduce stage. In the Reduce
stage, the data tuples can be combined into smaller
set of tuples. These are stored in Hadoop File System
(HDFS).
Entropy
Entropy is a measure of uncertainty for evaluating
clustering results. For each cluster j theentropy is
calculated as follow
Where, c is the number of classes, is the probability
that member of cluster j belongs to class i, where the
number of is objects of class i belonging to cluster j,
is total number of objects in cluster j. The total
entropy for all clusters is calculated as follow,
Where k is the number of clusters, is the total number
of objects in cluster j, and n is the total number of all
objects.
Semantic Web:
Semantic Web is an extension of Web which
is derived from World Wide Web Consortium
(W3C). If the semantic web is viewed in global
database, it is very easy to understand why one need
query language. W3C is used to improve the
collaboration, research and development, and
innovation development through Semantic Web
technology. It provides common framework which
allows the data to be shared and reused across
application enterprises. Semantic web acts as
integrator across different content, information
systems. It has applications in publishing, blogging
and many other areas. [2]
For assessing the proposed archive grouping
methodology, we perform two sorts of group
assessment; outer assessment, and inside assessment.
Outside assessment is connected when the reports are
marked (i.e. their actual group are known apriori).
Interior assessment is connected at the point when
reports marks are obscure.
Purity is a measure for the degree at which
every group contains single class name. To figure
immaculateness, for every group j, we register the
quantity of events for every class i and select the
ISSN: 2231-5381
most extreme event the virtue is in this manner the
summation of all greatest events separated by the
aggregate number of articles n.
F-measure
F-measure is a measure for evaluating the quality for
hierarchical clustering [4]. F-measure is amix of
recall and precision. In they considered each cluster
as a result of a query, and eachclass is considered as
the desired set of data. First the precision and recall
are computed foreach class i in each cluster j.
Where, is the number of objects of class i in cluster j,
is total number of objects in class I and is the total
number of objects in cluster j.The F-measure of class
i and cluster j is then computed as follow
the maximum value of F-measure of each class is
selected then, the total f-measure is calculatedas
following, where n is total number of data, c is the
total number of classes.
http://www.ijettjournal.org
Page 80
International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015
Internal evaluation
For internal evaluation, the goal is to maximize the
cosine similarity between each document and its
associated center. Then, we divide the results by the
total number of data.
Where k denotes to number of clusters, is the number
of data assigned to cluster is the centre of cluster is
the data vector. The value of this measure is a range
from 0 to1, as this value increases better clustering
results are achieved.
IV. CONCLUSION
In this paper we modified the data clustering in big
number of data. It is concluded that lexical categories
to represent the data tokens in semantic relations
between the words and reduce the features of
dimensions. In this paper we are using the lexical
categories and nouns only. In this data clustering the
possible data points with possible enhancements for
bisecting k-means using lexical patterns of data.
REFERENCES
[1] Zamir, O., &Etzioni, O. (1999). Grouper: a dynamic clustering
interface to Web search results.Computer Networks, 31(11), 13611374.
[2] Andrews, N. O., & Fox, E. A. (2007). Recent developments in
document clustering. Tech. rept. TR-07-35. Department of
Computer Science, Virginia Tech.
[3] Cheng, Y. (2008). Ontology-based fuzzy semantic clustering.
In Convergence and HybridInformation Technology, 2008.
ICCIT'08. Third International Conference on (Vol. 2, pp. 128133).IEEE.
[4] Recupero, D. R. (2007). A new unsupervised method for
document clustering by using WordNet lexical and conceptual
relations. Information Retrieval, 10(6), 563-579.
[5] Miller, G. A. (1995). WordNet: a lexical database for English.
Communications of the ACM, 8(11),39-41.
[6] Dean, J., &Ghemawat, S. (2008). MapReduce: simplified data
processing on large clusters.Communications of the ACM, 51(1),
107-113.
[7] Gruber, T. R. (1991). The role of common ontology in
achieving sharable, reusable knowledge bases. KR, 91, 601-602.
[8] Cantais, J., Dominguez, D., Gigante, V., Laera, L., &Tamma,
V. (2005). An example of food ontology for diabetes control. In
Proceedings of the International Semantic Web Conference
2005workshop on Ontology Patterns for the Semantic Web.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 81
Download