A Map Reduce Based Parallel Algorithm for Large Scale Hierarchical Classification Akshat Verma (Author) Sabyasachi Upadhyay (Author) B.E.(Hons) Computer Science, Birla Institute of Technology and Science(BITS), Pilani, Goa Campus Associate Consultant, EMC Corporation Bangalore, India B.E.(Hons) Computer Science, Birla Institute of Technology and Science(BITS), Pilani, Pilani Campus Associate Consultant, EMC Corporation Bangalore, India Abstract Hierarchical classification is a method of grouping in which input data is classified into categories arranged from general to specific, that is, in which the structure is initially arranged in broad groups that are then successively subdivided into narrower groups this final output is the overall hierarchical classification of the data. Hierarchical classifiers have their applications in many varies areas from computer vision, written text recognition, document organization and so on. Of all these different applications hierarchical classification has become very popular in recent times for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies which have widespread use. As the size of the hierarchy grows and the number of documents to be classified increases, it becomes a large scale data processing problem. This paper attempts at implementing hierarchical classification using the key-value features of map-reduce and using the appropriate classifiers suited to the problem in hand. Keywords: Hierarchical, Map-Reduce, Classification, Parallelization, Top Down feature vector Fi is made up of a collection of word nodes along with their term frequency in the document. Here, the N-fold cross-validation is employed wherein the entire corpus Sk is divided into N sets, where one of the sets is taken as the hold-out data with the unlabeled documents, while the remaining N-1 sets are assigned hierarchical class labels, thereby constituting the training set. The hierarchical class labels are assigned with parent node starting first, followed by their child nodes and so on, if any. Representation Doc ID, C0(or P), C1, C2 ….Cn, a:b c:d, e:f, g:h C0 or P Parent label C1,C2….Cn Child labels a:b c:d e:f g:h word nodes colon separated by the term frequencies The test set contains a set of unclassified documents whose classes are to be identified in the hierarchical format. II I METHODOLOGY INTRODUCTION A. Map Reduce Model In this paper, we present an algorithm for large scale hierarchical classification which is based on Map-Reduce parallel programming model. To make the presentation compact and more understandable we take the example of text classification to describe the flow of our algorithm, though it can be applied to any problem involving hierarchical classification ideas. The hierarchical text classification problem consists of a population Sk of k documents on which hierarchical classification labels are to be assigned. Each document di is represented as a feature vector Fi, where 1<=i<=k. Each Map Reduce[1] is a programming model for large scale parallel data processing which makes use of a distributed algorithm which can run on a cluster of machines. The Map-Reduce programming model has two components - Map and Reduce. A map is a function which is applied on a set of input values and forms a set of key and value pairs. Reduce is a function which takes these results and applies another function to the result of the map function. A reducer receives all the data for an individual key from all the mappers (shuffling), and due to the shuffling property, the keys are themselves in sorted form 1|P age when they are collected by the reducers (sorting). So, Map transforms a set of data into key value pairs and Reduce aggregates this data and outputs the result of a function applied on this aggregated data (grouping). The above concept is illustrated in Figure 1. B. Proposed Implementation The proposed methodology leverages map reduce paradigm to the hierarchical classification problem by breaking it into a multi-stage multi-label classification structure. Initially our trained documents are labelled as C0 (or P), C1, C2, C3….. whereas test documents are unlabeled. The first step involves training the classifier C on the parent label P, taking into account the complete training set Strain ϵ Sk and the key as {P, C1} (a combination of P and C1) in the mapper phase. In the reducer, each SMi is now split into subsets SMi1, SMi2……..SMij, Each of the subsets SMi is trained on the label C2 and the second level child labels are assigned to the corresponding test documents. The aforementioned procedures are thus recursively run (in this instance, key is <P,C1,C2> and in the subsequent phases {P,C1,C2,….Cp} and so on as the key) and then classification is performed on Cp+1(next child label of the parent document) till the stopping criterion is reached. Stopping Criterion: If the mapper output at any stage consists of only the test documents, those documents are considered as processed as during that phase, all the training class labels would have been set as the keys in the penultimate mapper phase. If the number of processed documents is equal to the cardinality of Stest i.e. nC(Stest), then it can be ascertained that all the documents are completely classified. Figure1. Map-Reduce Workflow classifying the test set Stest, such that Strain + Stest = Sk. giving the parent labels on each component of the test corpus Stest. After the parent labels are assigned to set Stest, Stest as well as Strain are taken as input to the map-reduce program where the parent labels Pi’s(or their respective C0’s) of both the training and the test set are set as keys in the Mapper phase. As a result, the documents with the parent label P i go to the same Reducer Ri. Thus, we have multiple subsets of the population, SM1, SM2…… where ith subset SMi consists of training as well as test documents. The training component of each individual subset SMi are trained on the label C1 and first level child labels are assigned to the corresponding test documents. Now, in the next iteration, the resulting outputs of the previous step are taken as input to the map program .In this step, we set Figure2. Algorithm Workflow 2|P age III Algorithm The pseudo code for our proposed algorithm (also illustrated in Figure 2) is presented as follows: Algorithm - Pseudo code Initial Step: Input : Strain <P(or C0),C1,C2….Ci Ftr2 Ftr3 ……..Ftrn> Stest < null Fts1 Fts2 Fts3 ………..Ftsm > Train(Strain, C0) Stest’ = Classify(Stest) Output: Stest’ < P Fts1 Fts2 Fts3 ………..Ftsm > Map-Reduce: While( !StoppingCriterion ): Mapper(): Mapper<Input> = {Strain , Stest’} i=0 S={} setKey -> ( S U {Ci}) ) setValue -> {Strain,Stest’} Mapper<Output> = (Ci,Strain U Stest’) Reducer(): Reducer<Input>=Mapper<Output> Train(Ci,Strain) Stest’ = Classify(Stest’) i++ There are two basic hierarchical classification methods, namely, the big-bang approach and the top-down level-based approach. In the big-bang approach, the classifier assigns a document to a class in one single step whereas in the top down level-based approach, the classification is achieved with the help of classifiers built at each level of the tree. The test document starts at the root of the tree and is compared to classes at the first level. The document is assigned to the best matching level-1 class and is then compared to all sub-classes of that class. This process continues until the document reaches a leaf or an internal class below which the document cannot be further classified. In our approach, we adopt a variant of top-down, level based approach. However, unlike most of the existing algorithms we employ a parallel workflow based on map-reduce key value concept. In our proposed method, the hierarchies are learned on the fly instead of using a pre-defined hierarchy tree. Most of the earlier works[4] do not address particularly the problem of very large datasets which we address here. V This paper explores new dimensions in hierarchical classification by implementing Map-Reduce based parallel algorithm. We divided the problem into multiple stages and the algorithm learned hierarchies on the way and tagged the unlabeled entities. We can employ different classifiers viz. SVM[2], Naïve Bayesian Classifier, Random Forest etc. as the cardinality of the training and test set is different at different stages. VI IV REFERENCES [1] Jeffrey Dean , Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.10-10, December 06-08, 2004, San Francisco, CAC. [2] Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, November 1995. [3] S.Dumais and H.Chen, Hierarchical Classification of Web Content, In SIGIR 2000 Canasai Kruengkrai , Chuleerat Jaruskulchai, “A Parallel Learning Algorithm for Text Classification,” The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Canada, July 2002. 89. RELATED WORK Several earlier attempts have been made in the field of text classification[3],[4]. All these approaches can be divided into two broad categories- flat classification and hierarchical classification. In the case of flat classification, predefined classes are treated separately while any relationship between the classes is ignored. Hierarchical classification, on the other hand, hierarchical relationship between the labels (classes) is taken into account. CONCLUSION [4] 3|P age