Effective XML Document Classification with Protection from Ambiguous Class Prediction Performance Evaluation with DBLP & Wikipedia Corpus Thasleena N.T.1,Uday Babu P.2,Varghese S.C.3 Dept. of Computer science & Engineering Rajagiri School of Engineering & Technology Kochi, India thalu555@gmail.com1, udaybabup@gmail.com2, varghesesc@rajagiritech.ac.in3 Abstract—XML documents are widely exploited in the internet for efficient storage, retrieval and display of data. Their classification is a significant task in the current scenario. A novel methodology based on supervised classification has been proposed to classify a given collection of XML documents. The methodology incorporates the potential to eliminate ambiguous class prediction. Rule based classification requires exact match and hence it is not an apt methodology to classify variant XML documents. At the same time other existing methodologies for XML document classification require higher time and memory requirements. The proposed methodology overcomes the drawbacks of the existing technologies. It extracts structural and content based features from the XML documents into transaction formats onto which FP-growth algorithm is executed to generate association rules. The generated association rules are provided to the classifier after eliminating irrelevant rules. The classifier is trained to perform the classification of normal as well as variant XML documents by utilizing the extracted association rules with effective handling of ambiguous classification. The proposed methodology accomplishes XML document classification based on the document’s content and structural information. Keywords—XML document classification; FP-growth algorithm; ambiguos class prediction; DBLP; wikipedia I. INTRODUCTION XML documents are semi-structured documents which form the most ideal platforms available to capture extract, process and present data effectively and easily. Hence data sets are modeled as XML documents and their classification into predefined classes is of extreme significance for their excellent management and processing. Classification is relevant to manage online repository, digital libraries in the current scenario where data storage is expanding at an alarming rate and their management is becoming highly tedious. The proposed methodology accomplishes XML document classification relies on the document’s content and structure by smoothly avoiding ambiguous class prediction (ACP). XML document classification was initially performed purely based on the document’s structure. Such classification methodologies which neglected the content of the documents could not tolerate even a slight variation in the structure. These traditional techniques revoked the ability to construct XML documents with flexibility in structure. Current methodologies recognized the relevance of the content of the XML documents to improve the accuracy of classification. Contents play an important role in determining the class of the XML documents. Hence the latest methodologies for XML document classification consider both structure and content information to accomplish their task. DBLP data set is the fundamental data set that is available for evaluating the power of XML document classification methodologies but this simple data set is highly inefficient in calculating the efficiency of classification techniques as it contains simple xml documents of uniform structures. All the existing methodologies have higher degree of time and a space complexity but they express high accuracy for DBLP data set. These techniques fail to classify XML documents in other prominent data sets like Wikipedia, IEEE with the same accuracy. The accuracy expressed falls by several folds for other data sets. The available methodologies also makes ambiguous class prediction i.e. the same XML document may get classified into two different classes for data sets other than DBLP. An efficient methodology should possess the potential to classify all types of XML documents with lower time and space requirements and must avoid ambiguous class prediction. The proposed methodology applies FP growth algorithm for generating association rules which not only diminishes time but also the memory usage compared to existing techniques like Apriori algorithm used for the same. The proposed methodology for classification also proposes a solution to avoid ambiguous class prediction problem inherent to all existing XML document classification methodologies. The overall construction of the paper is as follows. Chapter I gave an introduction into the concept and relevance of XML document classification. Chapter II will discuss about a few related works. Chapter III explains the proposed methodology to classify XML documents. Chapter IV shows the results obtained and Chapter V provides the conclusion and the scope for future work. II. RELATED WORKS In [1] XML documents are firstly converted into sets of appropriate attribute values that are determined by utilizing the structural relationships such as parent-child and next-sibling relationships that exists between the nodes present in the corresponding XML trees. C5 algorithm is applied to classify the generated attribute-value sets via tree induction. In [2] Graph Neural Networks is employed to classify XML documents. It is a supervised machine learning method that processes data structured as graphs without transforming into vector format. In [3] XML documents are transmuted into flat text a document that captures the structural features of XML documents as plain text. Naive Bayes classifier is exploited to execute classify the generated flattened documents. In [4] feature vectors representing structural and content based features of XML documents are generated. Frequent tree mining algorithm with an information gain filter extracts structural information from XML documents. Structural rules of required support and adequate confidence are retrieved by rule mining algorithm. The documents are initially classified into two sets; one that contains the left hand side of the rule and other that contains right hand side of the rule. Content information is extracted through four main steps. Initially preprocessing of XML documents are done by stemming and removing stop words. Inverted index are built for each XML document and clustering is performed on words of XML documents. Finally feature vectors are generated for every XML document. The retrieved feature vectors are concatenated to create a single vector to represent every XML document. The classifier model which is built with support vector machines (SVM) and decision-tree (DT) algorithms are trained with the generated feature vectors. In [5] structured vector model is employed to represent structural information. Naive Bayes classifier that is based on the adaptation from Bernoulli generative model to the structured vector model classifies the XML documents. In [6] bottom-up approach is executed. The notion behind this approach is based on the fact that most discriminating information is possessed by the terms with in the content. Key terms of each class are terms in the content and they are figured out so that they can be used for classification. A key path is a path from root element to leaf where the leaf contains at least one key term for the class. A model for a given class comprises of all the key paths present in the XML documents belonging to that class. The class models allow classification of an unlabeled XML document by means of a similarity-based scheme. The documents can be classified into the class with which it has highest similarity. Specifically, for each class, the similarity between the key paths in the corresponding class model and those in the structure of the previously unobserved XML document is suitably evaluated and the latter is eventually classified into the most resembling class. The probability of an XML document to be in a class is computed based on the sum of the similarities between two paths. In [9] a novel methodology XRULES for the classification of XML documents based on structural information has been proposed. XRULES iterates through all frequent embedded sub trees associated with each known class of the XML documents under consideration. Structural rules for prediction are generated from these sub trees. These rules determine the likelihood for an XML document to come under a particular class. In [10] a methodology for classifying XML documents without schema has been proposed by constructing of expressive feature spaces. Feature space was filled with ontological, structural and content related information. Twigs in the form of triplet consisting of labels of left child, parent, and right child form the structural features to be inserted into the feature space to be created. The percentage of content and structural features was selected by focusing on degree of diversity of structure in XML documents used for training. Support Vector Machines was recruited to perform the classification of XML documents using the generated feature space. III. PROPOSED METHODOLOGY The proposed system for the classification of XML documents involves two main phases, learning phase followed by the testing phase. A. Learning phase : Preprocessing The XML documents in the selected data set undergo pre processing phase initially. Preprocessing involves four steps. Stop words removal: Stop words (e.g. and, of) present in the content of the XML documents are discarded. Tokenization: XML documents are processed to generate tokens from the stream of words and tag names and attribute names. Stemming: The remaining terms in the content are transformed into their respective root or base form. (e.g. running to run, leaves to leaf) Division: Divide the XM documents on the basis of their classes. B. Learning phase : Feature Extraction The path from the root element to each term in the content of an XML document becomes a unique key path where the leaf node will be the term. A set of key paths retrieved from each XML document belonging to same class are clubbed together into same group. C. Learning phase : Rule Mining FP growth algorithm is applied class wise to the set of key paths from same class to retrieve the frequent key paths [11]. These frequent patterns form the rules whose consequent will be the respective class. The rules of requires support and confidence are saved. D. Learning phase : Pruning Rules generated from top p frequent patterns forms the primary set of rules and the remaining rules forms the secondary set of rules for the corresponding class. Most efficient rules are retained by pruning away irrelevant rules [8]. The retained rules will be exploited by the associative classifier for classifying XML documents. E. Testing phase The XML documents in the test set should be classified by the Associative classifier. These documents undergo pre processing and feature extraction phases. A set of key paths will be obtained from each XML document. The classifier compares the antecedent of the rules corresponding to each class with the key paths of a given document. If an exact match is found then class is obvious else the XML document is classified into the class with maximum percentage of matching. Secondary rules are exploited only if primary rules fail to perform classification. This reduces time usage. If an ambiguity arises in class prediction then terms related to the leaf node term is generated based on semantics. These newly generated terms are utilized to generate new key paths for the XML document to be classified and the process continues. Now key paths are matched to the rules of the two ambiguous classes predicted only and decision is taken accordingly. For example, XML documents of portals Spirituality and Christianity in Wikipedia fail to get classified into a unique class due to the similarity in the key paths of these portals which leads to ambiguous class prediction. Hence to deal with problems related to classification based on key paths, the terms in the XML document is exploited to arrive at a decision. During the training phase, XML documents belonging to each portal is used to capture the characteristic words of the respective portal that has discriminating power. Terms extracted from the XML document to be classified is matched with the terms captured from the XML documents of different portals in the current scenario. XML document is classified into a class/portal based on the degree to which the words or newly generated key paths are matching. IV. RESULTS The performance of the proposed methodology for XML document classification is established by exploiting DBLP and Wikipedia data sets. The proposed methodology classifies XML documents in DBLP data set with 100% accuracy without any ambiguous class prediction with a total runtime of 2500 seconds. The methodology shows 81.34% accuracy in classifying XML documents of Wikipedia data set and avoids ambiguous classification completely with a total runtime of 3546 seconds. The potential of an XML document classifier is expressed with average precision (P), average recall (R) and average Fmeasure (F) [8]. CBA, CPAR and XCSS are other existing classifiers. TABLE I. PERFORMANCE COMPARISON: WIKIPEDIA Model P R F ACPa Proposed System XCSS 0.81 0.80 0.79 NO 0.77 0.78 0.78 YES CBA 0.60 0.61 0.61 YES CPAR 0.73 0.72 0.73 YES a. V. Fig. 1. Proposed System : Learning Phase ACP is ambiguous class prediction CONCLUSION The proposed methodology is proved to possess the potential to classify XML documents with high accuracy. It brilliantly resolves the problem of ambiguous class prediction that prevailed in all existing classification methodologies. The methodology performs it task with reduced time and space requirements with the adaptation of FP tree method. The methodology requires enhancements in terms of P, R and F. As a future work leaf node representing the key terms of the XML documents can be mapped to its synonyms using a corpus. REFERENCES Candillier, L., Tellier, I., and Torre, F. “Transforming xml trees for efficient classification and clustering”, Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 2008, pp. 469–480.. [2] Candillier L., Tellier I., and Torre F., “Xml document mining using graph neural network,” In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 2007, pp. 458–472. [3] de Campos L., Fern´andez-Luna J., Huete J., and Romero A., “Probabilistic methods for structured document classification” at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 2008, pp. 195–206. [4] Mohammad Khabbaz, Keivan Kianmehr, and Reda Alhajj, “Employing Structural and Textual Feature Extraction for Semistructured Document Classification,” IEEE transactions on systems, man, and cybernetics part c: applications and reviews, vol. 42, no. 6, november 2012 [5] Yi J. and Sundaresan N. “A classifier for semi-structured documents,” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 2000, pp. 340-344. [6] Wu J. and Tang, J., “A bottom-up approach for xml documents classification,” In Proceedings of the International Symposium on Database Engineering and Applications. 2008, pp. 131-137. [7] Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2007. ACM SIGIR Forum 42(1), 22–28 (2008). [8] Gianni Costa, Riccardo Ortale, and Ettore Ritacco "Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content," XML Classification Based on Discriminatory Structures and Nested Content, pp. 156-171. [9] Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1–2, 137–170 [10] Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledgefor automatic classification of xml data. In Proceedings of the WebDB Workshop. 1–6. [11] Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques,” Second Edition, Morgan Kaufmann, 2001. [1]