Document 12914435

International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015 A Novel Clustering Model withImproved Map Reducing Technique overBig Data K.U.V.Padma 1, Y.Laxmana Rao2, P.Haribabu3 Assistant professor1, Assistant professor2, Assistant professor3 Raghu Institute of Technology, Modavalasa, Andhra Pradesh. 1,2,3 Abstract: In present days, data clustering is a complex task while increasing in more amount of data. In most common methods,data clustering is vector space model which represents bag of words and does not represent semantic relations between words. The purpose of this proposed work is integrating words with bisecting k-means to utilize the semantic relations between the data points.Our proposed model works on big data with Hadoop, the experimental results shows better performance results than the traditional approaches. I. INTRODUCTION Grouping is the assignment of collection or an arrangement of items in a manner that questions in the same gathering (called a grouping) are more comparative (in some sense or another) to one another than to those in different gatherings (bunches). It is a principle undertaking of exploratory information mining, and a typical procedure for measurable information investigation, utilized as a part of numerous fields, including machine learning, example acknowledgment, picture examination, data recovery, and bioinformatics.[1] Ontology has been embraced to improve report grouping, further it has been utilized to illuminate the issue of vast number of record elements [4]. Propelled by the significance of cosmology and its capacity to improve record bunching, in this exploration we research the utilization of WordNet metaphysics [5] with a specific end goal to diminish the enormous volume of report elements to only 26 highlights speaking to the WordNet lexical thing classifications. In addition, WordNet helps in speaking to semantic relations between terms. Cluster investigation itself is not one particular calculation, but rather the general errand to be comprehended. It can be accomplished by different calculations that contrast altogether in their idea of what constitutes a group and how to proficiently discover them. Well known ideas of ISSN: 2231-5381 bunches incorporate gatherings with little separations among the group individuals, thick ranges of the information space, interims or specific factual dispersions. Grouping can in this way be planned as a multi-target enhancement issue. The suitable bunching calculation and parameter settings (counting values, for example, the separation capacity to utilize, a thickness edge or the quantity of expected groups) rely on upon the individual information set and planned utilization of the outcomes. Group investigation all things considered is not a programmed assignment, but rather an iterative procedure of learning revelation or intuitive multi-target advancement that includes trial and disappointment. It will frequently be important to change information pre-processing and show parameters until the outcome accomplishes the fancied properties.[2][3] Although using WordNet to reduce document dimensionality enhances document clustering efficiency, the traditional implementation of this approach is not very efficient due to the huge volumes of data. Thus, using a parallel programming paradigm for implementing this solution turns to be a more appealing approach. Thus, we investigated running document clustering process in a distributed framework, by adopting the common MapReduce programming model [6]. MapReduce is a programming model initiated by Google’s Team for processing huge datasets in distributed systems. It presents simple and powerful interface that enables automatic parallelization and distribution of largescale computations. In addition, it enables programmers who have no experience with parallel and distributed system to utilize the resources of a large distributed system in easily and efficient way [6]. The two main problems which face document clustering process are namely: the huge volume of data, and the large size of document features. In this paper we propose a novel approach to enhance the efficiency of document clustering. The proposed system is divided into two phases; the objective of the first phase is to decrease the very http://www.ijettjournal.org Page 78 International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015 large number of document terms (also known as document features) through adopting WordNet ontology. The second phase aims to cope with the huge volume of data by employing the Map Reduce parallel programming paradigm to enhance the system performance. In the following discussion we discuss each phase in more details. knowledge. Historically, Ontology is derived from branch of philosophy, known as metaphysics. Ontology mainly describes the following:[8] 1, Individuals 2. Classes 3. Attributes II. RELATED WORK 4. Relationship Big Data: Big data is nothing but large amount of data which is availability of data, both structured and unstructured. Big data is arrived from multiple sources at velocity, volume and variety. It is also derived from exchange of data between systems using any media. Big data changes the people who are working in an organization and work together. It creates the way that business and IT leaders must join together to get values from all data. New skills are needed to get the big data. Some courses are offered to prepare new generation of big data. But some experts are developing new roles, focusing the key challenges and creating new models to get the most from big data. 4.4 million Data scientists are needed by 2015.[7] Individuals: Individuals are also called as Instances. Individuals are basic components in ontology. It may include different objects like people, animals, automobiles, molecules and planets as well as numbers and words. Classes: Classes are also called as concepts. Class is collection of objects. Class includes individuals, other classes o may be both. Some examples of classes are of the following; Person: class of all people Molecule: class of all molecules Data Scientists and their responsibilities: Number: class of all numbers Data scientists are responsible for designing and implementing processes and layouts for large scale data sets used for modeling, data mining and research processes. The role of Data scientists is to work on different projects at a time. Attributes: Objects can be described by assigning values to the attributes. Each attribute has specific name and value which are used to store the information. For example, a car has following attribute [4][9]. Responsibilities:     Name: Ford Explorer According to business needs, they have to develop and plan the required analytical projects. Works with application developers to extract the required data relevant to data analysis. Creates new data definition of data tables or files and develops the existing data for analysis for further usage. Manages and guides to the juniors in the team. Number of doors: 4 Transmission: 6-speed Relationship: An important use of an attribute is to describe the relationships between objects in ontology. A relation is an object whose attribute is another object in ontology. III. PROPOSED WORK Ontology: Ontology is the study of what things that exist. In Information technology, ontology is the working model of entities and interactions in the domain of ISSN: 2231-5381 Map reduce is a programming model which is an associated implementation for processing and generating large data sets with a parallel and http://www.ijettjournal.org Page 79 International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015 distributed algorithm. In Map reduce model, the data primitives are called mappers and reducers. Once we write an application in Map Reduce form, we can run the application in any number of times in a machine without any interruption. The major advantage of Map Reduce is that is easy for data processing over multiple computing nodes. Map Reduce program executes in following stages: Map Stage: The mapper’s job is to take set of data and converts it into another set of data, where individual elements are broken into tuples (key/value pairs). The input data is in the form of file and directory. Reduce Stage: This stage is the combination of Shuffle stage and Reduce stage. The output of map stage is taken as input to the Reduce stage. In the Reduce stage, the data tuples can be combined into smaller set of tuples. These are stored in Hadoop File System (HDFS). Entropy Entropy is a measure of uncertainty for evaluating clustering results. For each cluster j theentropy is calculated as follow Where, c is the number of classes, is the probability that member of cluster j belongs to class i, where the number of is objects of class i belonging to cluster j, is total number of objects in cluster j. The total entropy for all clusters is calculated as follow, Where k is the number of clusters, is the total number of objects in cluster j, and n is the total number of all objects. Semantic Web: Semantic Web is an extension of Web which is derived from World Wide Web Consortium (W3C). If the semantic web is viewed in global database, it is very easy to understand why one need query language. W3C is used to improve the collaboration, research and development, and innovation development through Semantic Web technology. It provides common framework which allows the data to be shared and reused across application enterprises. Semantic web acts as integrator across different content, information systems. It has applications in publishing, blogging and many other areas. [2] For assessing the proposed archive grouping methodology, we perform two sorts of group assessment; outer assessment, and inside assessment. Outside assessment is connected when the reports are marked (i.e. their actual group are known apriori). Interior assessment is connected at the point when reports marks are obscure. Purity is a measure for the degree at which every group contains single class name. To figure immaculateness, for every group j, we register the quantity of events for every class i and select the ISSN: 2231-5381 most extreme event the virtue is in this manner the summation of all greatest events separated by the aggregate number of articles n. F-measure F-measure is a measure for evaluating the quality for hierarchical clustering [4]. F-measure is amix of recall and precision. In they considered each cluster as a result of a query, and eachclass is considered as the desired set of data. First the precision and recall are computed foreach class i in each cluster j. Where, is the number of objects of class i in cluster j, is total number of objects in class I and is the total number of objects in cluster j.The F-measure of class i and cluster j is then computed as follow the maximum value of F-measure of each class is selected then, the total f-measure is calculatedas following, where n is total number of data, c is the total number of classes. http://www.ijettjournal.org Page 80 International Journal of Engineering Trends and Technology (IJETT) – Volume 29 Number 2 - November 2015 Internal evaluation For internal evaluation, the goal is to maximize the cosine similarity between each document and its associated center. Then, we divide the results by the total number of data. Where k denotes to number of clusters, is the number of data assigned to cluster is the centre of cluster is the data vector. The value of this measure is a range from 0 to1, as this value increases better clustering results are achieved. IV. CONCLUSION In this paper we modified the data clustering in big number of data. It is concluded that lexical categories to represent the data tokens in semantic relations between the words and reduce the features of dimensions. In this paper we are using the lexical categories and nouns only. In this data clustering the possible data points with possible enhancements for bisecting k-means using lexical patterns of data. REFERENCES [1] Zamir, O., &Etzioni, O. (1999). Grouper: a dynamic clustering interface to Web search results.Computer Networks, 31(11), 13611374. [2] Andrews, N. O., & Fox, E. A. (2007). Recent developments in document clustering. Tech. rept. TR-07-35. Department of Computer Science, Virginia Tech. [3] Cheng, Y. (2008). Ontology-based fuzzy semantic clustering. In Convergence and HybridInformation Technology, 2008. ICCIT'08. Third International Conference on (Vol. 2, pp. 128133).IEEE. [4] Recupero, D. R. (2007). A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 10(6), 563-579. [5] Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 8(11),39-41. [6] Dean, J., &Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.Communications of the ACM, 51(1), 107-113. [7] Gruber, T. R. (1991). The role of common ontology in achieving sharable, reusable knowledge bases. KR, 91, 601-602. [8] Cantais, J., Dominguez, D., Gigante, V., Laera, L., &Tamma, V. (2005). An example of food ontology for diabetes control. In Proceedings of the International Semantic Web Conference 2005workshop on Ontology Patterns for the Semantic Web. ISSN: 2231-5381 http://www.ijettjournal.org Page 81

Document 12914435

Related documents

Products

Support

Document 12914435

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib