View/Open

advertisement
A Map Reduce Based Parallel Algorithm for Large
Scale Hierarchical Classification
Akshat Verma (Author)
Sabyasachi Upadhyay (Author)
B.E.(Hons) Computer Science, Birla Institute of
Technology and Science(BITS), Pilani, Goa Campus
Associate Consultant, EMC Corporation
Bangalore, India
B.E.(Hons) Computer Science, Birla Institute of
Technology and Science(BITS), Pilani, Pilani Campus
Associate Consultant, EMC Corporation
Bangalore, India
Abstract
Hierarchical classification is a method of grouping in which input
data is classified into categories arranged from general to
specific, that is, in which the structure is initially arranged in
broad groups that are then successively subdivided into narrower
groups this final output is the overall hierarchical classification
of the data.
Hierarchical classifiers have their applications in many varies
areas from computer vision, written text recognition, document
organization and so on. Of all these different applications
hierarchical classification has become very popular in recent
times for the organization of text documents, particularly on the
Web. Web directories and Wikipedia are two examples of such
hierarchies which have widespread use. As the size of the
hierarchy grows and the number of documents to be classified
increases, it becomes a large scale data processing problem. This
paper attempts at implementing hierarchical classification using
the key-value features of map-reduce and using the appropriate
classifiers suited to the problem in hand.
Keywords: Hierarchical, Map-Reduce, Classification,
Parallelization, Top Down
feature vector Fi is made up of a collection of word nodes
along with their term frequency in the document.
Here, the N-fold cross-validation is employed wherein the
entire corpus Sk is divided into N sets, where one of the sets is
taken as the hold-out data with the unlabeled documents,
while the remaining N-1 sets are assigned hierarchical class
labels, thereby constituting the training set. The hierarchical
class labels are assigned with parent node starting first,
followed by their child nodes and so on, if any.
Representation
Doc ID, C0(or P), C1, C2 ….Cn, a:b c:d, e:f, g:h
C0 or P  Parent label
C1,C2….Cn  Child labels
a:b c:d e:f g:h  word nodes colon separated by the term
frequencies
The test set contains a set of unclassified documents whose
classes are to be identified in the hierarchical format.
II
I
METHODOLOGY
INTRODUCTION
A. Map Reduce Model
In this paper, we present an algorithm for large scale
hierarchical classification which is based on Map-Reduce
parallel programming model. To make the presentation
compact and more understandable we take the example of text
classification to describe the flow of our algorithm, though it
can be applied to any problem involving hierarchical
classification ideas.
The hierarchical text classification problem consists of a
population Sk of k documents on which hierarchical
classification labels are to be assigned. Each document di is
represented as a feature vector Fi, where 1<=i<=k. Each
Map Reduce[1] is a programming model for large scale parallel
data processing which makes use of a distributed algorithm
which can run on a cluster of machines. The Map-Reduce
programming model has two components - Map and Reduce.
A map is a function which is applied on a set of input values
and forms a set of key and value pairs. Reduce is a function
which takes these results and applies another function to the
result of the map function. A reducer receives all the data for
an individual key from all the mappers (shuffling), and due to
the shuffling property, the keys are themselves in sorted form
1|P age
when they are collected by the reducers (sorting). So, Map
transforms a set of data into key value pairs and Reduce
aggregates this data and outputs the result of a function
applied on this aggregated data (grouping). The above concept
is illustrated in Figure 1.
B. Proposed Implementation
The proposed methodology leverages map reduce paradigm to
the hierarchical classification problem by breaking it into a
multi-stage multi-label classification structure.
Initially our trained documents are labelled as C0 (or P), C1,
C2, C3….. whereas test documents are unlabeled. The first step
involves training the classifier C on the parent label P, taking
into account the complete training set Strain ϵ Sk and
the key as {P, C1} (a combination of P and C1) in the mapper
phase. In the reducer, each SMi is now split into subsets SMi1,
SMi2……..SMij, Each of the subsets SMi is trained on the label
C2 and the second level child labels are assigned to the
corresponding test documents.
The aforementioned procedures are thus recursively run (in
this instance, key is <P,C1,C2> and in the subsequent phases
{P,C1,C2,….Cp} and so on as the key) and then classification
is performed on Cp+1(next child label of the parent document)
till the stopping criterion is reached.
Stopping Criterion: If the mapper output at any stage
consists of only the test documents, those documents are
considered as processed as during that phase, all the training
class labels would have been set as the keys in the penultimate
mapper phase. If the number of processed documents is equal
to the cardinality of Stest i.e. nC(Stest), then it can be ascertained
that all the documents are completely classified.
Figure1. Map-Reduce Workflow
classifying the test set Stest, such that Strain + Stest = Sk. giving
the parent labels on each component of the test corpus Stest.
After the parent labels are assigned to set Stest, Stest as well as
Strain are taken as input to the map-reduce program where the
parent labels Pi’s(or their respective C0’s) of both the training
and the test set are set as keys in the Mapper phase. As a
result, the documents with the parent label P i go to the same
Reducer Ri. Thus, we have multiple subsets of the population,
SM1, SM2…… where ith subset SMi consists of training as well as
test documents. The training component of each individual
subset SMi are trained on the label C1 and first level child
labels are assigned to the corresponding test documents.
Now, in the next iteration, the resulting outputs of the previous
step are taken as input to the map program .In this step, we set
Figure2. Algorithm Workflow
2|P age
III
Algorithm
The pseudo code for our proposed algorithm (also illustrated
in Figure 2) is presented as follows:
Algorithm - Pseudo code
Initial Step:
Input :
Strain  <P(or C0),C1,C2….Ci Ftr2 Ftr3
……..Ftrn>
Stest  < null Fts1 Fts2 Fts3 ………..Ftsm >
Train(Strain, C0)
Stest’ = Classify(Stest)
Output:
Stest’  < P Fts1 Fts2 Fts3 ………..Ftsm >
Map-Reduce:
While( !StoppingCriterion ):
Mapper():
Mapper<Input> = {Strain , Stest’}
i=0
S={}
setKey -> ( S U {Ci}) )
setValue -> {Strain,Stest’}
Mapper<Output> = (Ci,Strain U
Stest’)
Reducer():
Reducer<Input>=Mapper<Output>
Train(Ci,Strain)
Stest’ = Classify(Stest’)
i++
There are two basic hierarchical classification methods,
namely, the big-bang approach and the top-down level-based
approach. In the big-bang approach, the classifier assigns a
document to a class in one single step whereas in the top down
level-based approach, the classification is achieved with the
help of classifiers built at each level of the tree. The test
document starts at the root of the tree and is compared to
classes at the first level. The document is assigned to the best
matching level-1 class and is then compared to all sub-classes
of that class. This process continues until the document
reaches a leaf or an internal class below which the document
cannot be further classified.
In our approach, we adopt a variant of top-down, level based
approach. However, unlike most of the existing algorithms we
employ a parallel workflow based on map-reduce key value
concept. In our proposed method, the hierarchies are learned
on the fly instead of using a pre-defined hierarchy tree. Most
of the earlier works[4] do not address particularly the problem
of very large datasets which we address here.
V
This paper explores new dimensions in hierarchical
classification by implementing Map-Reduce based parallel
algorithm. We divided the problem into multiple stages and
the algorithm learned hierarchies on the way and tagged the
unlabeled entities. We can employ different classifiers viz.
SVM[2], Naïve Bayesian Classifier, Random Forest etc. as the
cardinality of the training and test set is different at different
stages.
VI
IV
REFERENCES
[1]
Jeffrey Dean , Sanjay Ghemawat, MapReduce: simplified data
processing on large clusters, Proceedings of the 6th conference on
Symposium on Opearting Systems Design & Implementation, p.10-10,
December 06-08, 2004, San Francisco, CAC.
[2]
Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20:273-297, November 1995.
[3]
S.Dumais and H.Chen, Hierarchical Classification of Web
Content, In SIGIR 2000
Canasai Kruengkrai , Chuleerat Jaruskulchai, “A Parallel Learning
Algorithm for Text Classification,” The Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD-2002), Canada, July 2002.
89.
RELATED WORK
Several earlier attempts have been made in the field of text
classification[3],[4]. All these approaches can be divided into
two broad categories- flat classification and hierarchical
classification. In the case of flat classification, predefined
classes are treated separately while any relationship between
the classes is ignored. Hierarchical classification, on the other
hand, hierarchical relationship between the labels (classes) is
taken into account.
CONCLUSION
[4]
3|P age
Download