CMSC 691 Project Report Preliminary Report 09-May-12 Purusharth Prakash and Mohid Farazi 1. Introduction Typically, classification has been studied as binary classification problem where each data instance (tuple) can have only one of the two class labels. These studies led to the development of several standardized approaches, such as Naïve Bayes, Decision Trees, Neural Networks, SVM, kNN, and ensemble methods such as AdaBoost (Tan, Steinbach et al. 2006). Since real world problems can be more complex, extensions have been developed to allow (i) multiclass classification (ii) collaborative filtering (iii) Multi-label classification. The Amazon dataset (Weifeng Liu 2012) studied in this project is an example of a multi-label classification problem. Multi-label classification in relational data is becoming an increasing relevant problem, as most of the databases are relational and as size increases, drawing inferences becomes increasingly difficult. Multi-label classification has a wide variety of real world applications, e.g. hyperlinked document classification, social networks analysis and collaboration networks analysis. A lot of the effort in multi-label classification has been involved in with respect to text classification, which has been extensively studied (Tsoumakas and Katakis 2007). In the remainder of this document, we present the algorithms, techniques and evaluation of the multi-label classification problem in the Amazon data set (Weifeng Liu 2012). 2. Problem Description Amazon has made available historical data for access of type of resources by a group of employees over a period of one year as part of their MLSP Competition (Weifeng Liu 2012). The primary goal of the project is to build a classification model utilizing the employee profile characteristics as features to predict the set of resources the employee would require access. The training data after processing and filtering is in the following form πππ‘ ππ πΆπππ π πΏπππππ β = {π1 , π2 , … . . , ππ } ππππππππ π·ππ‘ππ ππ‘ π = {(π₯1 , πΏ1 ), (π₯2 , πΏ2 ), … . . , (π₯π , πΏπ )} π€βπππ π₯π π π‘β πππ π‘ππππ ππ π‘βπ ππππ‘π’ππ π£πππ‘ππ , π₯π ∈ π , πΏπ ⊆ β π. π. πππβ π‘πππππππ π‘π’πππ (π₯, πΏ) ∈ π × 2β Goal is to find a classification scheme β β βΆ π → 2β which minimizes some error metric. In the present study the amazon data-set consists of 8,116,042 instances/tuples. The label set consists of 14999 unique labels. 3. Background and Related Work Multi-label classification is studied extensively in literature mostly in context of text classification. One of the first important algorithms to come out in this context is the AdaBoost.MH (Schapire and Singer 2000) which an extension of the famous AdaBoost algorithm which was implemented into the BoosTexter text classification system(Schapire and Singer 2000). Multi-label classification has received a great deal of attention in machine learning in recent years, and a number of methods has been developed, often motivated by specific types of applications such as text categorization (Schapire and Singer 2000; Ueda and Saito 2002; Kazawa, Izumitani et al. 2005; Zhang and Zhou 2006), computer vision(Boutell, Luo et al. 2004), and bioinformatics (Clare and King 2001; Elisseeff and Weston 2001; Zhang and Zhou 2006). Besides, several well-established methods for conventional classification have been extended to the multi-label case, including support vector machines (Elisseeff and Weston 2001; Boutell, Luo et al. 2004; Godbole and Sarawagi 2004), neural networks (Zhang and Zhou 2006), and decision trees (Vens, Struyf et al. 2008). Methods developed to address multi-label classification problems can be developed into two groups (a) problem transformation methods, and (b) algorithm adaptation methods. 3.1 Problem Transformation Methods The first group of methods transform the multi-label classification task into one or more single-label classification (BR & LP) or label ranking (LR) tasks. 3.1.1 Binary Relevance (BR) One of the most straight forward approaches which is algorithm independent is to divide the problem into q individual binary classification problems (Yang 1999). Decomposition can be achieved by using either the One-vs-rest or the One-vs-one approach. This has the advantage that it transforms the problem into the well understood binary classification problem, and any of the standard algorithms can be used. There are three disadvantages here. First, a large number of classifiers will need to be trained if the size of the label set β is large. Also, for each classification query we have to run all of these q binary classifiers to obtain a prediction for each class. The second disadvantage is that any correlation between classes is not utilized as each class is treated independently. The third disadvantage is that it will suffer from the class imbalance problem as there will be a large number of negative data tuples and only a few positive data tuples. Researchers have shown that the classifiers such as k-Nearest-Neighbor, neural networks and linear least squares fit mapping are viable techniques for this approach(Yang 1999), as are support vector machines. To incorporate class correlation in the Binary Relevance approach several modified approaches, such as Classifier Chaining (Read, Pfahringer et al. 2009) has been proposed. These algorithms have been shown to have a significantly improved performance compared to the standard BR approach. 3.1.2 Label Powerset (LP) The basis of this method is to combine entire label sets into atomic (single) labels to form a single-label problem for which the set of possible single labels represents all distinct label subsets in the original multi-label representation (Read, Pfahringer et al. 2008). A disadvantage of these methods, however, is their worst-case time complexity. The LP based RAKEL (RAndom k-labELsets) algorithm (Tsoumakas and Vlahavas 2007) attempts to alleviate the computational complexity by considering a small random subset of labels from the powerset, and then creates a binary classifier for this subset. 3.1.3 Label Ranking (LR) Label ranking (Schapire and Singer 2000; Elisseeff and Weston 2001; Crammer and Singer 2002; Brinker, Fürnkranz et al. 2006) learn a multi-label classifier β in an indirect way via a scoring function π: π × β → β that assigns a real number to each instance/label combination. The predicted score π(π₯, π) usually represents the probability that π is in the label set for π₯. Given these scores a classifier can be implemented simply by using a threshold β βΆ {π ∈ πΏ|π(π₯, π) ≥ π‘} π€βπππ π‘ ∈ β is a threshold. The ranking can be exploited in an evaluation metric in training phase by comparing the predicted ranking of the labels instead of the predicted label subset, to the actual label subset. One of the advantages of label ranking approaches is better handling of large numbers of classes because only a single ranking function is learned. Similar to the binary classification approaches, the label ranking approaches are usually unable to exploit the class correlation information 3.2. Algorithm Adaptation Methods The second group of methods extends specific learning algorithms in order to handle multi-label data directly. There exist multi-label extensions of decision tree (Clare and King 2001; Vens, Struyf et al. 2008), support vector machine (Elisseeff and Weston 2001; Boutell, Luo et al. 2004; Godbole and Sarawagi 2004), neural network (Crammer and Singer 2003; Zhang and Zhou 2006), Bayesian (McCallum 1999), lazy learning (Zhu, Ji et al. 2005) and boosting (Schapire and Singer 2000) learning algorithms Many of these algorithms have been developed to incorporate the class correlation into multi-label learning, including (McCallum 1999; Elisseeff and Weston 2001; Crammer and Singer 2002; Jin and Ghahramani 2002; Ueda and Saito 2002; Boutell, Luo et al. 2004; Gao, Wu et al. 2004; Taskar, Chatalbashev et al. 2004; Tsochantaridis, Hofmann et al. 2004; Ghamrawi and McCallum 2005; Kazawa, Izumitani et al. 2005; Zhu, Ji et al. 2005). But most of these studies are limited to a relatively small number of classes and assume that the amount of training data is sufficient for training reliable classifiers. In contrast, the real-world application of multi-label learning often features a large number of classes and a relatively small size of training data. As a result, the amount of training data related to each class is often sparse and insufficient for learning a reliable classifier. The consensus view in the literature is that it is crucial to take into account label correlations during the classification process (Godbole and Sarawagi 2004; Tsoumakas and Vlahavas 2007; Yan, Tesic et al. 2007; Ji, Tang et al. 2008; Loza Mencía and Fürnkranz 2008; Read, Pfahringer et al. 2008; Sun, Ji et al. 2008). However as the size of multi-label datasets grow, most methods struggle with the exponential growth in the number of possible correlations. Consequently, these methods are more accurate on small datasets, but are not as applicable to larger datasets. 3.3 Other methods Other methods which do not create an explicit classification function are Instance Based Learning methods (Aha, Kibler et al. 1991). These include the adaptation of knearest neighbor method for multi-label classification (ML-kNN) (Zhang and Zhou 2007) and its extension using logistic regression (Cheng and Hüllermeier 2009). 4. Proposed Method, Experiments, Validation The following approaches/algorithms have been selected primarily because of their capability of handling large datasets while maintaining sufficiently good predictive accuracy. 4.1 Binary Relevance Approach The main idea behind this approach is to decompose the problem into binary classification problems. Because the number of labels in the label set is very large (approximately 15000), the one-vs-one approach is not feasible. Therefore, the training data set is transformed using the one-vs-rest approach. The main steps of the BR approach are outlined below πΉππ π ← 1 … |β| π ′ ← {} πππ πππβ (π₯, πΏ) ∈ π ππ ππ ∈ πΏ π ′ ← π ′ ∪ {π₯, 1} πππ π π ′ ← π ′ ∪ {π₯, 0} πππππ πΆπππ π πππππ βπ βΆ π ′ → ππ The following classification algorithms will be used to train each classifier, to set the base line performance for classification. 4.1.1 Naïve Bayes (BR Naïve Bayes) For a given instance of the feature vector π₯π ∈ π , for which the associated set of labels πΏπ ⊆ β and each instance π₯π = {π₯π1 , π₯π2 , … . , π₯ππ } , where π is the number of attributes. π(π ∈ πΏπ ) ∗ π(π₯π1 , π₯π2 , … . , π₯ππ |π ∈ πΏπ ) π(π ∈ πΏπ |π₯π1 , π₯π2 , … . , π₯ππ ) = (1) π(π₯π1 , π₯π2 , … . , π₯ππ ) Assuming that the attributes are conditionally independent π(π₯ππ |π, π₯π1 , . . , π₯ππ−1 , π₯ππ+1 , … . , π₯ππ ) = π(π₯ππ |π) and π(π₯π1 , π₯π2 , … . , π₯ππ ) is constant for a given input set Based on this eqn (1) can be simplified to π π(π|π₯π ) = π(π ∈ πΏπ |π₯π1 , π₯π2 , … . , π₯ππ ) = π(π ∈ πΏπ ) ∗ ∏ π(π₯ππ |π ∈ πΏπ ) π=1 since the attributes are all categorical and each attribute space is very large (~10000), the prior probabilities for the attributes is estimated using the mestimate approach π(π₯ππ |π πππ + ππ ∈ πΏπ ) = ππ + π where ππ = number of instances with class label π , πππ = number of instances with class label π and attribute π₯ππ , π = estimate of prior probability of attribute π₯ππ , π = equivalent sample size. Based on the posterior probabilities of the labels the classification function can be defined as βπ (π₯π ) = 1 ππ (π(π ∈ πΏπ |π₯π ) ≥ π(π ∉ πΏπ |π₯π )) = 0 ππ‘βπππ€ππ π 4.1.2 k-Nearest Neighbor (BRkNN) We propose to use the BRkNN (Spyromitros, Tsoumakas et al. 2008) algorithm which is a faster adaptation of the standard kNN algorithm for the Binary Relevance (BR) problem. The best value for k is selected by the hold-one-out cross validation method, where the Hamming Loss is minimized. 4.1.3 ML-kNN ML-kNN (Zhang and Zhou 2007) is a binary relevance (BR) learner. However, instead of using the standard voting scheme in kNN classifier it uses Naïve Bayes amongst the k-nearest neighbors. Given a query instance, π₯, it finds the k-nearest neighbors of π₯ in the training data and counts the number of occurrences of π among these neighbors. Based on this count we can define the event πΈππ , such that among the k-nearest neighbors of π₯, exactly j instances have label j. Using the Baye’s rule the posterior probability of π ∈ πΏ is given by π(π ∈ πΏ|πΈππ ) = π π(πΈπ |π ∈ πΏ)∗π(π∈πΏ) π(πΈππ ) which leads to the classification rule βπ (π₯) = 1 ππ (π(πΈππ |π ∈ πΏ) ∗ π(π ∈ πΏ) ≥ π(πΈππ |π ∉ πΏ) ∗ π(π ∉ πΏ)) = 0 ππ‘βπππ€ππ π The prior probabilities and the conditional probabilities are estimated using the relative frequency from the training. It may be computationally intensive, as k neighborhoods need to be evaluated for each training instance. 4.1.4 Classifier Chains The Classifier Chain model(Read, Pfahringer et al. 2009) learns |β| binary classifiers (β1 … . β|β| ) as BR learner, but the classifiers are learnt in a sequential manner, so that at each step the feature space is augmented with previously learned label. This allows the correlation between class labels to be accounted, although in a limited sense. Classifier Chains Training Algorithm πΉππ π ∈ 1 … |β| π ′ ← {} πππ πππβ (π₯, πΏ) ∈ π βΆ π ′ ← π ′ ∪ {π₯, π1 … ππ−1 } πππππ πΆπππ π πππππ βπ βΆ π ′ → ππ Classifier Chains Classification Algorithm πΏ ← {} πΉππ π ← 1 … |β| πΏ ← πΏ ∪ (ππ ← βπ βΆ {π₯, π1 … ππ−1 }) The time complexity of the chaining based classifier is same as the base BR training algorithm. 4.2 Evaluation Metric To evaluate the performance of multi-label classification methods, several criteria exist, but most of them evaluate rank based classification (Tsoumakas, Katakis et al. 2010). For a classifier β, let β(π₯) ⊆ β denote its multi-label prediction for an instance π₯, and let πΏ denote the actual set labels for training instance π₯. We propose to use Hamming Loss as the evaluation metric defined as follows: 1 |β(π₯)Δ πΏ| π»πππΏππ π (β) = |β| where ο is the symmetric difference between sets Hamming loss computes the percentage of labels whose relevance is predicted incorrectly. Since the set of labels for each instance will be represented as a bitstring, the hamming distance can be easily computed by counting the number of mismatched bits. 5. Data Analysis and Preprocessing For the given problem the feature vector xi consists of 8 attributes and is given as {MGR_ID, ROLE_ROLLUP_1, ROLE_FAMILY, ROLE_CODE} ROLE_ROLLUP_2, ROLE_DEPTNAME, ROLE_TITLE, ROLE_FAMILY_DESC, All the attributes in the feature vector represent categorical data. The data has been re-labeled using a randomly selected, unique integer for the purpose of privacy preservation. Data Filtering - Invalid/Missing Data On analysis the data did not contain any invalid or missing values. However several inconsistencies were present. For e.g. at any given time an employee cannot have more than one manager. Tuple containing these inconsistent values were discarded from further consideration. - Redundant Data On analysis of the it was found that the same data tuples were repeated several time, probably as an artifact of maintaining all snapshots of the employee profile in the relational database. All such redundant tuples were removed from the data-set. This resulted in a data-set of only 460452 tuples, a decrease of almost by a factor of 20. Data Consolidation The data supplied by amazon consists of 3 separate tables (i) Employee Profile History (Feature Vector) (ii) Resource Access History (Label Set) (iii) Employee Resource Permission Snapshots (Additional Label Information) It is obvious that the data tables have been derived from a relational database. To create an input data-set for the classification algorithm, all the data tables need to be consolidated into a single table where each row is comprised of feature vector and the associated label set. Consolidation of the dataset presents several challenges mainly because Employee Profile History and Resource Access History contain different timestamps. To accomplish this, the time interval for each employee profile was determined which was then mapped with the resource access history. As an additional complexity, it was found on multiple occasions, the resource access interval overlapped with two different employee profiles, i.e. the resource access started in one profile and ended in another. This created an ambiguity as to which profile does the resource really belonged. It is also possible that the resource belongs to both the profiles. In the present case we assume that the resource definitely belongs to the first profile, where it was originally accessed. For the second profile we assign the resource on a conditional basis, such that it is used with a reduced weight or added only in cases with borderline predictions. There is also a third table which provides resource access snapshots at the start and end of the whole data time period. This allows the formulation of a more complete resource access for each profile. Dimensionality Reduction Based on their definitions and titles, it seems that attributes are completely independent for e.g. ROLE_FAMILY_DESC seems to be directly correlated to ROLE_FAMILY and ROLE_CODE . Removal of any redundant attributes will improve the accuracy of naïve bayes prediction as it is based on the conditional independence of the attributes. Since the attributes represent categorical data, PCA cannot be directly applied. Instead Multiple Correspondence Analysis (MCA) or nonlinear PCA (NL-PCA) needs to be applied. MCA analysis is not feasible in the current data-set as it involves building an indicator matrix of all instances as rows and all categories as columns, with {0,1} as entries. The number of categories associates with each attribute is extremely high (~10000), as such the memory requirements of the indicator matrix exceed current hardware limits. As a result all available library functions in the R package fail to run. To accomplish this task, an incremental version of the MCA algorithm will need to be implemented from scratch. The other option is to use NL-PCA which is currently available in only two commercial statistical packages SAS and SPSS. The possibility of acquiring these tools is currently being explored. 6. Results To be included in the final report 7. Conclusion To be included in the final report 8. References Aha, D. W., D. Kibler, et al. (1991). "Instance-based learning algorithms." Machine Learning 6(1): 37-66. Boutell, M. R., J. Luo, et al. (2004). "Learning multi-label scene classification." Pattern Recognition 37(9): 1757-1771. Brinker, K., J. Fürnkranz, et al. (2006). A unified model for multilabel classification and ranking, IOS Press. Cheng, W. and E. Hüllermeier (2009). "Combining instance-based learning and logistic regression for multilabel classification." Machine Learning 76(2): 211-225. Clare, A. and R. King (2001). "Knowledge discovery in multi-label phenotype data." Principles of Data Mining and Knowledge Discovery: 42-53. Crammer, K. and Y. Singer (2002). A new family of online algorithms for category ranking, ACM. Crammer, K. and Y. Singer (2003). "A family of additive online algorithms for category ranking." The Journal of Machine Learning Research 3: 1025-1058. Elisseeff, A. and J. Weston (2001). "A kernel method for multi-labelled classification." Advances in neural information processing systems 14: 681687. Gao, S., W. Wu, et al. (2004). A MFoM learning approach to robust multiclass multilabel text categorization, ACM. Ghamrawi, N. and A. McCallum (2005). Collective multi-label classification, ACM. Godbole, S. and S. Sarawagi (2004). "Discriminative methods for multi-labeled classification." Advances in Knowledge Discovery and Data Mining: 22-30. Ji, S., L. Tang, et al. (2008). Extracting shared subspace for multi-label classification, ACM. Jin, R. and Z. Ghahramani (2002). "Learning with multiple labels." Advances in neural information processing systems 15: 897-904. Kazawa, H., T. Izumitani, et al. (2005). "Maximal margin labeling for multi-topic text categorization." Advances in neural information processing systems 17: 649656. Loza Mencía, E. and J. Fürnkranz (2008). "Efficient pairwise multilabel classification for large-scale problems in the legal domain." Machine Learning and Knowledge Discovery in Databases: 50-65. McCallum, A. (1999). Multi-label text classification with a mixture model trained by EM. Read, J., B. Pfahringer, et al. (2008). Multi-label classification using ensembles of pruned sets, IEEE. Read, J., B. Pfahringer, et al. (2009). "Classifier chains for multi-label classification." Machine Learning and Knowledge Discovery in Databases: 254-269. Schapire, R. E. and Y. Singer (2000). "BoosTexter: A boosting-based system for text categorization." Machine Learning 39(2): 135-168. Spyromitros, E., G. Tsoumakas, et al. (2008). "An empirical study of lazy multilabel classification algorithms." Artificial Intelligence: Theories, Models and Applications: 401-406. Sun, L., S. Ji, et al. (2008). Hypergraph spectral learning for multi-label classification, ACM. Tan, P.-N., M. Steinbach, et al. (2006). Introduction to data mining. Boston, Pearson Addison Wesley. Taskar, B., V. Chatalbashev, et al. (2004). Learning associative Markov networks, ACM. Tsochantaridis, I., T. Hofmann, et al. (2004). Support vector machine learning for interdependent and structured output spaces, ACM. Tsoumakas, G. and I. Katakis (2007). "Multi-label classification: An overview." International Journal of Data Warehousing and Mining (IJDWM) 3(3): 1-13. Tsoumakas, G., I. Katakis, et al. (2010). "Mining multi-label data." Data Mining and Knowledge Discovery Handbook: 667-685. Tsoumakas, G. and I. Vlahavas (2007). "Random k-labelsets: An ensemble method for multilabel classification." Machine Learning: ECML 2007: 406-417. Ueda, N. and K. Saito (2002). "Parametric mixture models for multi-labeled text." Advances in neural information processing systems 15: 721-728. Vens, C., J. Struyf, et al. (2008). "Decision trees for hierarchical multi-label classification." Machine Learning 73(2): 185-214. Weifeng Liu, V. D. C., Catherine Huang, Kenneth E. Hild, Ken Montanez. (2012). "The Eighth Annual MLSP Competition." from https://sites.google.com/site/amazonaccessdatacompetition/. Yan, R., J. Tesic, et al. (2007). Model-shared subspace boosting for multi-label classification, ACM. Yang, Y. (1999). "An evaluation of statistical approaches to text categorization." Information retrieval 1(1): 69-90. Zhang, M. L. and Z. H. Zhou (2006). "Multilabel neural networks with applications to functional genomics and text categorization." Knowledge and Data Engineering, IEEE Transactions on 18(10): 1338-1351. Zhang, M. L. and Z. H. Zhou (2007). "ML-KNN: A lazy learning approach to multilabel leaming." Pattern Recognition 40(7): 2038-2048. Zhu, S., X. Ji, et al. (2005). Multi-labelled classification using maximum entropy method, ACM.