
Effective XML Document Classification with
Protection from Ambiguous Class Prediction
Performance Evaluation with DBLP & Wikipedia Corpus
Thasleena N.T.1,Uday Babu P.2,Varghese S.C.3
Dept. of Computer science & Engineering
Rajagiri School of Engineering & Technology
Kochi, India
thalu555@gmail.com1, udaybabup@gmail.com2,
Abstract—XML documents are widely exploited in the
internet for efficient storage, retrieval and display of data. Their
classification is a significant task in the current scenario. A novel
methodology based on supervised classification has been
proposed to classify a given collection of XML documents. The
methodology incorporates the potential to eliminate ambiguous
class prediction. Rule based classification requires exact match
and hence it is not an apt methodology to classify variant XML
documents. At the same time other existing methodologies for
XML document classification require higher time and memory
requirements. The proposed methodology overcomes the
drawbacks of the existing technologies. It extracts structural and
content based features from the XML documents into transaction
formats onto which FP-growth algorithm is executed to generate
association rules. The generated association rules are provided to
the classifier after eliminating irrelevant rules. The classifier is
trained to perform the classification of normal as well as variant
XML documents by utilizing the extracted association rules with
effective handling of ambiguous classification. The proposed
methodology accomplishes XML document classification based
on the document’s content and structural information.
algorithm; ambiguos class prediction; DBLP; wikipedia
XML documents are semi-structured documents which
form the most ideal platforms available to capture extract,
process and present data effectively and easily. Hence data sets
are modeled as XML documents and their classification into
predefined classes is of extreme significance for their excellent
management and processing. Classification is relevant to
manage online repository, digital libraries in the current
scenario where data storage is expanding at an alarming rate
and their management is becoming highly tedious. The
proposed methodology accomplishes XML document
classification relies on the document’s content and structure by
smoothly avoiding ambiguous class prediction (ACP).
XML document classification was initially performed
purely based on the document’s structure. Such classification
methodologies which neglected the content of the documents
could not tolerate even a slight variation in the structure. These
traditional techniques revoked the ability to construct XML
documents with flexibility in structure. Current methodologies
recognized the relevance of the content of the XML documents
to improve the accuracy of classification. Contents play an
important role in determining the class of the XML documents.
Hence the latest methodologies for XML document
classification consider both structure and content information
to accomplish their task.
DBLP data set is the fundamental data set that is available
for evaluating the power of XML document classification
methodologies but this simple data set is highly inefficient in
calculating the efficiency of classification techniques as it
contains simple xml documents of uniform structures. All the
existing methodologies have higher degree of time and a space
complexity but they express high accuracy for DBLP data set.
These techniques fail to classify XML documents in other
prominent data sets like Wikipedia, IEEE with the same
accuracy. The accuracy expressed falls by several folds for
other data sets. The available methodologies also makes
ambiguous class prediction i.e. the same XML document may
get classified into two different classes for data sets other than
DBLP. An efficient methodology should possess the potential
to classify all types of XML documents with lower time and
space requirements and must avoid ambiguous class prediction.
The proposed methodology applies FP growth algorithm
for generating association rules which not only diminishes time
but also the memory usage compared to existing techniques
like Apriori algorithm used for the same. The proposed
methodology for classification also proposes a solution to
avoid ambiguous class prediction problem inherent to all
existing XML document classification methodologies.
The overall construction of the paper is as follows. Chapter
I gave an introduction into the concept and relevance of XML
document classification. Chapter II will discuss about a few
related works. Chapter III explains the proposed methodology
to classify XML documents. Chapter IV shows the results
obtained and Chapter V provides the conclusion and the scope
for future work.
In [1] XML documents are firstly converted into sets of
appropriate attribute values that are determined by utilizing the
structural relationships such as parent-child and next-sibling
relationships that exists between the nodes present in the
corresponding XML trees. C5 algorithm is applied to classify
the generated attribute-value sets via tree induction.
In [2] Graph Neural Networks is employed to classify XML
documents. It is a supervised machine learning method that
processes data structured as graphs without transforming into
vector format.
In [3] XML documents are transmuted into flat text a
document that captures the structural features of XML
documents as plain text. Naive Bayes classifier is exploited to
execute classify the generated flattened documents.
In [4] feature vectors representing structural and content
based features of XML documents are generated. Frequent tree
mining algorithm with an information gain filter extracts
structural information from XML documents. Structural rules
of required support and adequate confidence are retrieved by
rule mining algorithm. The documents are initially classified
into two sets; one that contains the left hand side of the rule and
other that contains right hand side of the rule. Content
information is extracted through four main steps. Initially
preprocessing of XML documents are done by stemming and
removing stop words. Inverted index are built for each XML
document and clustering is performed on words of XML
documents. Finally feature vectors are generated for every
XML document. The retrieved feature vectors are concatenated
to create a single vector to represent every XML document.
The classifier model which is built with support vector
machines (SVM) and decision-tree (DT) algorithms are trained
with the generated feature vectors.
In [5] structured vector model is employed to represent
structural information. Naive Bayes classifier that is based on
the adaptation from Bernoulli generative model to the
structured vector model classifies the XML documents.
In [6] bottom-up approach is executed. The notion behind
this approach is based on the fact that most discriminating
information is possessed by the terms with in the content. Key
terms of each class are terms in the content and they are figured
out so that they can be used for classification. A key path is a
path from root element to leaf where the leaf contains at least
one key term for the class. A model for a given class comprises
of all the key paths present in the XML documents belonging
to that class. The class models allow classification of an
unlabeled XML document by means of a similarity-based
scheme. The documents can be classified into the class with
which it has highest similarity. Specifically, for each class, the
similarity between the key paths in the corresponding class
model and those in the structure of the previously unobserved
XML document is suitably evaluated and the latter is
eventually classified into the most resembling class. The
probability of an XML document to be in a class is computed
based on the sum of the similarities between two paths.
In [9] a novel methodology XRULES for the classification
of XML documents based on structural information has been
proposed. XRULES iterates through all frequent embedded sub
trees associated with each known class of the XML documents
under consideration. Structural rules for prediction are
generated from these sub trees. These rules determine the
likelihood for an XML document to come under a particular
In [10] a methodology for classifying XML documents
without schema has been proposed by constructing of
expressive feature spaces. Feature space was filled with
ontological, structural and content related information. Twigs
in the form of triplet consisting of labels of left child, parent,
and right child form the structural features to be inserted into
the feature space to be created. The percentage of content and
structural features was selected by focusing on degree of
diversity of structure in XML documents used for training.
Support Vector Machines was recruited to perform the
classification of XML documents using the generated feature
The proposed system for the classification of XML
documents involves two main phases, learning phase followed
by the testing phase.
A. Learning phase : Preprocessing
The XML documents in the selected data set undergo pre
processing phase initially. Preprocessing involves four steps.
 Stop words removal: Stop words (e.g. and, of) present
in the content of the XML documents are discarded.
 Tokenization: XML documents are processed to
generate tokens from the stream of words and tag names
and attribute names.
 Stemming: The remaining terms in the content are
transformed into their respective root or base form. (e.g.
running to run, leaves to leaf)
 Division: Divide the XM documents on the basis of
their classes.
B. Learning phase : Feature Extraction
The path from the root element to each term in the content
of an XML document becomes a unique key path where the
leaf node will be the term. A set of key paths retrieved from
each XML document belonging to same class are clubbed
together into same group.
C. Learning phase : Rule Mining
FP growth algorithm is applied class wise to the set of key
paths from same class to retrieve the frequent key paths [11].
These frequent patterns form the rules whose consequent will
be the respective class. The rules of requires support and
confidence are saved.
D. Learning phase : Pruning
Rules generated from top p frequent patterns forms the
primary set of rules and the remaining rules forms the
secondary set of rules for the corresponding class. Most
efficient rules are retained by pruning away irrelevant rules
[8]. The retained rules will be exploited by the associative
classifier for classifying XML documents.
E. Testing phase
The XML documents in the test set should be classified by
the Associative classifier. These documents undergo pre
processing and feature extraction phases. A set of key paths
will be obtained from each XML document.
The classifier compares the antecedent of the rules
corresponding to each class with the key paths of a given
document. If an exact match is found then class is obvious else
the XML document is classified into the class with maximum
percentage of matching. Secondary rules are exploited only if
primary rules fail to perform classification. This reduces time
If an ambiguity arises in class prediction then terms related
to the leaf node term is generated based on semantics. These
newly generated terms are utilized to generate new key paths
for the XML document to be classified and the process
continues. Now key paths are matched to the rules of the two
ambiguous classes predicted only and decision is taken
accordingly. For example, XML documents of portals
Spirituality and Christianity in Wikipedia fail to get classified
into a unique class due to the similarity in the key paths of
these portals which leads to ambiguous class prediction.
Hence to deal with problems related to classification based on
key paths, the terms in the XML document is exploited to
arrive at a decision. During the training phase, XML
documents belonging to each portal is used to capture the
characteristic words of the respective portal that has
discriminating power. Terms extracted from the XML
document to be classified is matched with the terms captured
from the XML documents of different portals in the current
scenario. XML document is classified into a class/portal based
on the degree to which the words or newly generated key
paths are matching.
The performance of the proposed methodology for XML
document classification is established by exploiting DBLP and
Wikipedia data sets. The proposed methodology classifies
XML documents in DBLP data set with 100% accuracy
without any ambiguous class prediction with a total runtime of
2500 seconds. The methodology shows 81.34% accuracy in
classifying XML documents of Wikipedia data set and avoids
ambiguous classification completely with a total runtime of
3546 seconds.
The potential of an XML document classifier is expressed
with average precision (P), average recall (R) and average Fmeasure (F) [8]. CBA, CPAR and XCSS are other existing
Fig. 1. Proposed System : Learning Phase
ACP is ambiguous class prediction
The proposed methodology is proved to possess the
potential to classify XML documents with high accuracy. It
brilliantly resolves the problem of ambiguous class prediction
that prevailed in all existing classification methodologies. The
methodology performs it task with reduced time and space
requirements with the adaptation of FP tree method. The
methodology requires enhancements in terms of P, R and F. As
a future work leaf node representing the key terms of the XML
documents can be mapped to its synonyms using a corpus.
Candillier, L., Tellier, I., and Torre, F. “Transforming xml trees for
efficient classification and clustering”, Advances in XML Information
Retrieval and Evaluation, Proceedings of the 4th International Workshop
of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M.
Lalmas, S. Malik, and G. Kazai Eds., Springer, 2008, pp. 469–480..
[2] Candillier L., Tellier I., and Torre F., “Xml document mining using
graph neural network,” In Comparative Evaluation of XML Information
Retrieval Systems, Proceedings of the 5th International Workshop of the
Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and
A. Trotman Eds., Springer, 2007, pp. 458–472.
[3] de Campos L., Fern´andez-Luna J., Huete J., and Romero A.,
“Probabilistic methods for structured document classification” at
inex’07. In Focused Access to XML Documents, Proceedings of the 6th
International Workshop of the Initiative for the Evaluation of XML
Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds.,
Springer, 2008, pp. 195–206.
[4] Mohammad Khabbaz, Keivan Kianmehr, and Reda Alhajj, “Employing
Structural and Textual Feature Extraction for Semistructured Document
Classification,” IEEE transactions on systems, man, and cybernetics part
c: applications and reviews, vol. 42, no. 6, november 2012
[5] Yi J. and Sundaresan N. “A classifier for semi-structured documents,” In
Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovey and Data Mining. 2000, pp. 340-344.
[6] Wu J. and Tang, J., “A bottom-up approach for xml documents
classification,” In Proceedings of the International Symposium on
Database Engineering and Applications. 2008, pp. 131-137.
[7] Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex
2007. ACM SIGIR Forum 42(1), 22–28 (2008).
[8] Gianni Costa, Riccardo Ortale, and Ettore Ritacco "Learning Effective
XML Classifiers Based on Discriminatory Structures and Nested
Content," XML Classification Based on Discriminatory Structures and
Nested Content, pp. 156-171.
[9] Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for
structural classification of xml data. Mach. Learn. 62, 1–2, 137–170
[10] Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure,
annotation, and ontological knowledgefor automatic classification of
xml data. In Proceedings of the WebDB Workshop. 1–6.
[11] Jiawei Han and Micheline Kamber, “Data Mining: Concepts and