CMSC 691 Project Report
Preliminary Report
09-May-12
Purusharth Prakash and Mohid Farazi
1.
Introduction
Typically, classification has been studied as binary classification problem where
each data instance (tuple) can have only one of the two class labels. These studies
led to the development of several standardized approaches, such as Naïve Bayes,
Decision Trees, Neural Networks, SVM, kNN, and ensemble methods such as
AdaBoost (Tan, Steinbach et al. 2006).
Since real world problems can be more complex, extensions have been developed to
allow (i) multiclass classification (ii) collaborative filtering (iii) Multi-label
classification. The Amazon dataset (Weifeng Liu 2012) studied in this project is an
example of a multi-label classification problem.
Multi-label classification in relational data is becoming an increasing relevant
problem, as most of the databases are relational and as size increases, drawing
inferences becomes increasingly difficult. Multi-label classification has a wide
variety of real world applications, e.g. hyperlinked document classification, social
networks analysis and collaboration networks analysis.
A lot of the effort in multi-label classification has been involved in with respect to
text classification, which has been extensively studied (Tsoumakas and Katakis
2007). In the remainder of this document, we present the algorithms, techniques
and evaluation of the multi-label classification problem in the Amazon data set
(Weifeng Liu 2012).
2.
Problem Description
Amazon has made available historical data for access of type of resources by a group
of employees over a period of one year as part of their MLSP Competition (Weifeng
Liu 2012). The primary goal of the project is to build a classification model utilizing
the employee profile characteristics as features to predict the set of resources the
employee would require access.
The training data after processing and filtering is in the following form
𝑆𝑒𝑑 π‘œπ‘“ πΆπ‘™π‘Žπ‘ π‘  πΏπ‘Žπ‘π‘’π‘™π‘  β„’ = {𝑙1 , 𝑙2 , … . . , π‘™π‘ž }
π‘‡π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” π·π‘Žπ‘‘π‘Žπ‘ π‘’π‘‘ 𝑇 = {(π‘₯1 , 𝐿1 ), (π‘₯2 , 𝐿2 ), … . . , (π‘₯π‘š , πΏπ‘š )}
π‘€β„Žπ‘’π‘Ÿπ‘’ π‘₯𝑖 𝑖 π‘‘β„Ž π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’ π‘œπ‘“ π‘‘β„Žπ‘’ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ π‘£π‘’π‘π‘‘π‘œπ‘Ÿ , π‘₯𝑖 ∈ 𝕏 , 𝐿𝑖 ⊆ β„’
𝑖. 𝑒. π‘’π‘Žπ‘β„Ž π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” 𝑑𝑒𝑝𝑙𝑒 (π‘₯, 𝐿) ∈ 𝕏 × 2β„’
Goal is to find a classification scheme β„Ž
β„Ž ∢ 𝕏 → 2β„’
which minimizes some error metric.
In the present study the amazon data-set consists of 8,116,042 instances/tuples.
The label set consists of 14999 unique labels.
3.
Background and Related Work
Multi-label classification is studied extensively in literature mostly in context of text
classification. One of the first important algorithms to come out in this context is the
AdaBoost.MH (Schapire and Singer 2000) which an extension of the famous
AdaBoost algorithm which was implemented into the BoosTexter text classification
system(Schapire and Singer 2000).
Multi-label classification has received a great deal of attention in machine learning
in recent years, and a number of methods has been developed, often motivated by
specific types of applications such as text categorization (Schapire and Singer 2000;
Ueda and Saito 2002; Kazawa, Izumitani et al. 2005; Zhang and Zhou 2006),
computer vision(Boutell, Luo et al. 2004), and bioinformatics (Clare and King 2001;
Elisseeff and Weston 2001; Zhang and Zhou 2006). Besides, several well-established
methods for conventional classification have been extended to the multi-label case,
including support vector machines (Elisseeff and Weston 2001; Boutell, Luo et al.
2004; Godbole and Sarawagi 2004), neural networks (Zhang and Zhou 2006), and
decision trees (Vens, Struyf et al. 2008).
Methods developed to address multi-label classification problems can be developed
into two groups (a) problem transformation methods, and (b) algorithm adaptation
methods.
3.1
Problem Transformation Methods
The first group of methods transform the multi-label classification task into one or
more single-label classification (BR & LP) or label ranking (LR) tasks.
3.1.1 Binary Relevance (BR)
One of the most straight forward approaches which is algorithm independent is to
divide the problem into q individual binary classification problems (Yang 1999).
Decomposition can be achieved by using either the One-vs-rest or the One-vs-one
approach. This has the advantage that it transforms the problem into the well
understood binary classification problem, and any of the standard algorithms can be
used. There are three disadvantages here. First, a large number of classifiers will
need to be trained if the size of the label set β„’ is large. Also, for each classification
query we have to run all of these q binary classifiers to obtain a prediction for each
class. The second disadvantage is that any correlation between classes is not utilized
as each class is treated independently. The third disadvantage is that it will suffer
from the class imbalance problem as there will be a large number of negative data
tuples and only a few positive data tuples.
Researchers have shown that the classifiers such as k-Nearest-Neighbor, neural
networks and linear least squares fit mapping are viable techniques for this
approach(Yang 1999), as are support vector machines.
To incorporate class correlation in the Binary Relevance approach several modified
approaches, such as Classifier Chaining (Read, Pfahringer et al. 2009) has been
proposed. These algorithms have been shown to have a significantly improved
performance compared to the standard BR approach.
3.1.2 Label Powerset (LP)
The basis of this method is to combine entire label sets into atomic (single) labels to
form a single-label problem for which the set of possible single labels represents all
distinct label subsets in the original multi-label representation (Read, Pfahringer et
al. 2008). A disadvantage of these methods, however, is their worst-case time
complexity. The LP based RAKEL (RAndom k-labELsets) algorithm (Tsoumakas and
Vlahavas 2007) attempts to alleviate the computational complexity by considering a
small random subset of labels from the powerset, and then creates a binary
classifier for this subset.
3.1.3 Label Ranking (LR)
Label ranking (Schapire and Singer 2000; Elisseeff and Weston 2001; Crammer and
Singer 2002; Brinker, Fürnkranz et al. 2006) learn a multi-label classifier β„Ž in an
indirect way via a scoring function 𝑓: 𝕏 × β„’ → ℝ that assigns a real number to each
instance/label combination. The predicted score 𝑓(π‘₯, 𝑙) usually represents the
probability that 𝑙 is in the label set for π‘₯. Given these scores a classifier can be
implemented simply by using a threshold
β„Ž ∢ {𝑙 ∈ 𝐿|𝑓(π‘₯, 𝑙) ≥ 𝑑} π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑑 ∈ ℝ is a threshold.
The ranking can be exploited in an evaluation metric in training phase by comparing
the predicted ranking of the labels instead of the predicted label subset, to the actual
label subset. One of the advantages of label ranking approaches is better handling of
large numbers of classes because only a single ranking function is learned. Similar to
the binary classification approaches, the label ranking approaches are usually
unable to exploit the class correlation information
3.2.
Algorithm Adaptation Methods
The second group of methods extends specific learning algorithms in order to
handle multi-label data directly. There exist multi-label extensions of decision tree
(Clare and King 2001; Vens, Struyf et al. 2008), support vector machine (Elisseeff
and Weston 2001; Boutell, Luo et al. 2004; Godbole and Sarawagi 2004), neural
network (Crammer and Singer 2003; Zhang and Zhou 2006), Bayesian (McCallum
1999), lazy learning (Zhu, Ji et al. 2005) and boosting (Schapire and Singer 2000)
learning algorithms
Many of these algorithms have been developed to incorporate the class correlation
into multi-label learning, including (McCallum 1999; Elisseeff and Weston 2001;
Crammer and Singer 2002; Jin and Ghahramani 2002; Ueda and Saito 2002; Boutell,
Luo et al. 2004; Gao, Wu et al. 2004; Taskar, Chatalbashev et al. 2004;
Tsochantaridis, Hofmann et al. 2004; Ghamrawi and McCallum 2005; Kazawa,
Izumitani et al. 2005; Zhu, Ji et al. 2005). But most of these studies are limited to a
relatively small number of classes and assume that the amount of training data is
sufficient for training reliable classifiers. In contrast, the real-world application of
multi-label learning often features a large number of classes and a relatively small
size of training data. As a result, the amount of training data related to each class is
often sparse and insufficient for learning a reliable classifier.
The consensus view in the literature is that it is crucial to take into account label
correlations during the classification process (Godbole and Sarawagi 2004;
Tsoumakas and Vlahavas 2007; Yan, Tesic et al. 2007; Ji, Tang et al. 2008; Loza
Mencía and Fürnkranz 2008; Read, Pfahringer et al. 2008; Sun, Ji et al. 2008).
However as the size of multi-label datasets grow, most methods struggle with the
exponential growth in the number of possible correlations. Consequently, these
methods are more accurate on small datasets, but are not as applicable to larger
datasets.
3.3
Other methods
Other methods which do not create an explicit classification function are Instance
Based Learning methods (Aha, Kibler et al. 1991). These include the adaptation of knearest neighbor method for multi-label classification (ML-kNN) (Zhang and Zhou
2007) and its extension using logistic regression (Cheng and Hüllermeier 2009).
4.
Proposed Method, Experiments, Validation
The following approaches/algorithms have been selected primarily because of their
capability of handling large datasets while maintaining sufficiently good predictive
accuracy.
4.1
Binary Relevance Approach
The main idea behind this approach is to decompose the problem into binary
classification problems. Because the number of labels in the label set is very large
(approximately 15000), the one-vs-one approach is not feasible. Therefore, the
training data set is transformed using the one-vs-rest approach.
The main steps of the BR approach are outlined below
πΉπ‘œπ‘Ÿ 𝑗 ← 1 … |β„’|
𝑇 ′ ← {}
π‘“π‘œπ‘Ÿ π‘’π‘Žπ‘β„Ž (π‘₯, 𝐿) ∈ 𝑇
𝑖𝑓 𝑙𝑗 ∈ 𝐿 𝑇 ′ ← 𝑇 ′ ∪ {π‘₯, 1}
𝑒𝑙𝑠𝑒 𝑇 ′ ← 𝑇 ′ ∪ {π‘₯, 0}
π‘‡π‘Ÿπ‘Žπ‘–π‘› πΆπ‘™π‘Žπ‘ π‘ π‘–π‘“π‘–π‘’π‘Ÿ β„Žπ‘— ∢ 𝑇 ′ → 𝑙𝑗
The following classification algorithms will be used to train each classifier, to set the
base line performance for classification.
4.1.1 Naïve Bayes (BR Naïve Bayes)
For a given instance of the feature vector π‘₯𝑖 ∈ 𝕏 , for which the associated set of
labels 𝐿𝑖 ⊆ β„’ and each instance π‘₯𝑖 = {π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 } , where 𝑑 is the number of
attributes.
𝑃(𝑙 ∈ 𝐿𝑖 ) ∗ 𝑃(π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 |𝑙 ∈ 𝐿𝑖 )
𝑃(𝑙 ∈ 𝐿𝑖 |π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 ) =
(1)
𝑃(π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 )
Assuming that the attributes are conditionally independent
𝑃(π‘₯π‘–π‘˜ |𝑙, π‘₯𝑖1 , . . , π‘₯π‘–π‘˜−1 , π‘₯π‘–π‘˜+1 , … . , π‘₯𝑖𝑑 ) = 𝑃(π‘₯π‘–π‘˜ |𝑙)
and 𝑃(π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 ) is constant for a given input set
Based on this eqn (1) can be simplified to
𝑑
𝑃(𝑙|π‘₯𝑖 ) = 𝑃(𝑙 ∈ 𝐿𝑖 |π‘₯𝑖1 , π‘₯𝑖2 , … . , π‘₯𝑖𝑑 ) = 𝑃(𝑙 ∈ 𝐿𝑖 ) ∗ ∏ 𝑃(π‘₯π‘–π‘˜ |𝑙 ∈ 𝐿𝑖 )
π‘˜=1
since the attributes are all categorical and each attribute space is very large
(~10000), the prior probabilities for the attributes is estimated using the mestimate approach
𝑃(π‘₯π‘–π‘˜ |𝑙
π‘›π‘™π‘˜ + π‘šπ‘
∈ 𝐿𝑖 ) =
𝑛𝑙 + π‘š
where 𝑛𝑙 = number of instances with class label 𝑙 , π‘›π‘™π‘˜ = number of instances with
class label 𝑙 and attribute π‘₯π‘–π‘˜ , 𝑝 = estimate of prior probability of attribute π‘₯π‘–π‘˜ , π‘š =
equivalent sample size. Based on the posterior probabilities of the labels the
classification function can be defined as
β„Žπ‘™ (π‘₯𝑖 ) = 1 𝑖𝑓 (𝑃(𝑙 ∈ 𝐿𝑖 |π‘₯𝑖 ) ≥ 𝑃(𝑙 ∉ 𝐿𝑖 |π‘₯𝑖 ))
= 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
4.1.2
k-Nearest Neighbor (BRkNN)
We propose to use the BRkNN (Spyromitros, Tsoumakas et al. 2008) algorithm
which is a faster adaptation of the standard kNN algorithm for the Binary Relevance
(BR) problem. The best value for k is selected by the hold-one-out cross validation
method, where the Hamming Loss is minimized.
4.1.3 ML-kNN
ML-kNN (Zhang and Zhou 2007) is a binary relevance (BR) learner. However,
instead of using the standard voting scheme in kNN classifier it uses Naïve Bayes
amongst the k-nearest neighbors. Given a query instance, π‘₯, it finds the k-nearest
neighbors of π‘₯ in the training data and counts the number of occurrences of 𝑙 among
these neighbors. Based on this count we can define the event 𝐸𝑗𝑙 , such that among
the k-nearest neighbors of π‘₯, exactly j instances have label j. Using the Baye’s rule
the posterior probability of 𝑙 ∈ 𝐿 is given by
𝑃(𝑙 ∈ 𝐿|𝐸𝑗𝑙 ) =
𝑙
𝑃(𝐸𝑗 |𝑙
∈ 𝐿)∗𝑃(𝑙∈𝐿)
𝑃(𝐸𝑗𝑙 )
which leads to the classification rule
β„Žπ‘– (π‘₯) = 1 𝑖𝑓 (𝑃(𝐸𝑗𝑙 |𝑙 ∈ 𝐿) ∗ 𝑃(𝑙 ∈ 𝐿) ≥ 𝑃(𝐸𝑗𝑙 |𝑙 ∉ 𝐿) ∗ 𝑃(𝑙 ∉ 𝐿))
= 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
The prior probabilities and the conditional probabilities are estimated using the
relative frequency from the training. It may be computationally intensive, as k
neighborhoods need to be evaluated for each training instance.
4.1.4 Classifier Chains
The Classifier Chain model(Read, Pfahringer et al. 2009) learns |β„’| binary classifiers
(β„Ž1 … . β„Ž|β„’| ) as BR learner, but the classifiers are learnt in a sequential manner, so
that at each step the feature space is augmented with previously learned label. This
allows the correlation between class labels to be accounted, although in a limited
sense.
Classifier Chains Training Algorithm
πΉπ‘œπ‘Ÿ 𝑗 ∈ 1 … |β„’|
𝑇 ′ ← {}
π‘“π‘œπ‘Ÿ π‘’π‘Žπ‘β„Ž (π‘₯, 𝐿) ∈ 𝑇 ∢ 𝑇 ′ ← 𝑇 ′ ∪ {π‘₯, 𝑙1 … 𝑙𝑗−1 }
π‘‡π‘Ÿπ‘Žπ‘–π‘› πΆπ‘™π‘Žπ‘ π‘ π‘–π‘“π‘–π‘’π‘Ÿ β„Žπ‘— ∢ 𝑇 ′ → 𝑙𝑗
Classifier Chains Classification Algorithm
𝐿 ← {}
πΉπ‘œπ‘Ÿ 𝑗 ← 1 … |β„’|
𝐿 ← 𝐿 ∪ (𝑙𝑗 ← β„Žπ‘— ∢ {π‘₯, 𝑙1 … 𝑙𝑗−1 })
The time complexity of the chaining based classifier is same as the base BR training
algorithm.
4.2
Evaluation Metric
To evaluate the performance of multi-label classification methods, several criteria
exist, but most of them evaluate rank based classification (Tsoumakas, Katakis et al.
2010). For a classifier β„Ž, let β„Ž(π‘₯) ⊆ β„’ denote its multi-label prediction for an
instance π‘₯, and let 𝐿 denote the actual set labels for training instance π‘₯. We propose
to use Hamming Loss as the evaluation metric defined as follows:
1
|β„Ž(π‘₯)Δ πΏ|
π»π‘Žπ‘šπΏπ‘œπ‘ π‘ (β„Ž) =
|β„’|
where  is the symmetric difference between sets
Hamming loss computes the percentage of labels whose relevance is predicted
incorrectly. Since the set of labels for each instance will be represented as a bitstring, the hamming distance can be easily computed by counting the number of
mismatched bits.
5.
Data Analysis and Preprocessing
For the given problem the feature vector xi consists of 8 attributes and is given as
{MGR_ID, ROLE_ROLLUP_1,
ROLE_FAMILY, ROLE_CODE}
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
All the attributes in the feature vector represent categorical data. The data has been
re-labeled using a randomly selected, unique integer for the purpose of privacy
preservation.
Data Filtering
- Invalid/Missing Data
On analysis the data did not contain any invalid or missing values. However
several inconsistencies were present. For e.g. at any given time an employee
cannot have more than one manager. Tuple containing these inconsistent
values were discarded from further consideration.
- Redundant Data
On analysis of the it was found that the same data tuples were repeated
several time, probably as an artifact of maintaining all snapshots of the
employee profile in the relational database. All such redundant tuples were
removed from the data-set. This resulted in a data-set of only 460452 tuples,
a decrease of almost by a factor of 20.
Data Consolidation
The data supplied by amazon consists of 3 separate tables
(i) Employee Profile History (Feature Vector)
(ii) Resource Access History (Label Set)
(iii) Employee Resource Permission Snapshots (Additional Label Information)
It is obvious that the data tables have been derived from a relational database. To
create an input data-set for the classification algorithm, all the data tables need to be
consolidated into a single table where each row is comprised of feature vector and
the associated label set. Consolidation of the dataset presents several challenges
mainly because Employee Profile History and Resource Access History contain
different timestamps. To accomplish this, the time interval for each employee profile
was determined which was then mapped with the resource access history. As an
additional complexity, it was found on multiple occasions, the resource access
interval overlapped with two different employee profiles, i.e. the resource access
started in one profile and ended in another. This created an ambiguity as to which
profile does the resource really belonged. It is also possible that the resource
belongs to both the profiles. In the present case we assume that the resource
definitely belongs to the first profile, where it was originally accessed. For the
second profile we assign the resource on a conditional basis, such that it is used with
a reduced weight or added only in cases with borderline predictions. There is also a
third table which provides resource access snapshots at the start and end of the
whole data time period. This allows the formulation of a more complete resource
access for each profile.
Dimensionality Reduction
Based on their definitions and titles, it seems that attributes are completely
independent for e.g. ROLE_FAMILY_DESC seems to be directly correlated to
ROLE_FAMILY and ROLE_CODE . Removal of any redundant attributes will improve
the accuracy of naïve bayes prediction as it is based on the conditional
independence of the attributes. Since the attributes represent categorical data, PCA
cannot be directly applied. Instead Multiple Correspondence Analysis (MCA) or nonlinear PCA (NL-PCA) needs to be applied.
MCA analysis is not feasible in the current data-set as it involves building an
indicator matrix of all instances as rows and all categories as columns, with {0,1} as
entries. The number of categories associates with each attribute is extremely high
(~10000), as such the memory requirements of the indicator matrix exceed current
hardware limits. As a result all available library functions in the R package fail to
run. To accomplish this task, an incremental version of the MCA algorithm will need
to be implemented from scratch.
The other option is to use NL-PCA which is currently available in only two
commercial statistical packages SAS and SPSS. The possibility of acquiring these
tools is currently being explored.
6.
Results
To be included in the final report
7.
Conclusion
To be included in the final report
8.
References
Aha, D. W., D. Kibler, et al. (1991). "Instance-based learning algorithms." Machine
Learning 6(1): 37-66.
Boutell, M. R., J. Luo, et al. (2004). "Learning multi-label scene classification." Pattern
Recognition 37(9): 1757-1771.
Brinker, K., J. Fürnkranz, et al. (2006). A unified model for multilabel classification
and ranking, IOS Press.
Cheng, W. and E. Hüllermeier (2009). "Combining instance-based learning and
logistic regression for multilabel classification." Machine Learning 76(2):
211-225.
Clare, A. and R. King (2001). "Knowledge discovery in multi-label phenotype data."
Principles of Data Mining and Knowledge Discovery: 42-53.
Crammer, K. and Y. Singer (2002). A new family of online algorithms for category
ranking, ACM.
Crammer, K. and Y. Singer (2003). "A family of additive online algorithms for
category ranking." The Journal of Machine Learning Research 3: 1025-1058.
Elisseeff, A. and J. Weston (2001). "A kernel method for multi-labelled
classification." Advances in neural information processing systems 14: 681687.
Gao, S., W. Wu, et al. (2004). A MFoM learning approach to robust multiclass multilabel text categorization, ACM.
Ghamrawi, N. and A. McCallum (2005). Collective multi-label classification, ACM.
Godbole, S. and S. Sarawagi (2004). "Discriminative methods for multi-labeled
classification." Advances in Knowledge Discovery and Data Mining: 22-30.
Ji, S., L. Tang, et al. (2008). Extracting shared subspace for multi-label classification,
ACM.
Jin, R. and Z. Ghahramani (2002). "Learning with multiple labels." Advances in
neural information processing systems 15: 897-904.
Kazawa, H., T. Izumitani, et al. (2005). "Maximal margin labeling for multi-topic text
categorization." Advances in neural information processing systems 17: 649656.
Loza Mencía, E. and J. Fürnkranz (2008). "Efficient pairwise multilabel classification
for large-scale problems in the legal domain." Machine Learning and
Knowledge Discovery in Databases: 50-65.
McCallum, A. (1999). Multi-label text classification with a mixture model trained by
EM.
Read, J., B. Pfahringer, et al. (2008). Multi-label classification using ensembles of
pruned sets, IEEE.
Read, J., B. Pfahringer, et al. (2009). "Classifier chains for multi-label classification."
Machine Learning and Knowledge Discovery in Databases: 254-269.
Schapire, R. E. and Y. Singer (2000). "BoosTexter: A boosting-based system for text
categorization." Machine Learning 39(2): 135-168.
Spyromitros, E., G. Tsoumakas, et al. (2008). "An empirical study of lazy multilabel
classification algorithms." Artificial Intelligence: Theories, Models and
Applications: 401-406.
Sun, L., S. Ji, et al. (2008). Hypergraph spectral learning for multi-label classification,
ACM.
Tan, P.-N., M. Steinbach, et al. (2006). Introduction to data mining. Boston, Pearson
Addison Wesley.
Taskar, B., V. Chatalbashev, et al. (2004). Learning associative Markov networks,
ACM.
Tsochantaridis, I., T. Hofmann, et al. (2004). Support vector machine learning for
interdependent and structured output spaces, ACM.
Tsoumakas, G. and I. Katakis (2007). "Multi-label classification: An overview."
International Journal of Data Warehousing and Mining (IJDWM) 3(3): 1-13.
Tsoumakas, G., I. Katakis, et al. (2010). "Mining multi-label data." Data Mining and
Knowledge Discovery Handbook: 667-685.
Tsoumakas, G. and I. Vlahavas (2007). "Random k-labelsets: An ensemble method
for multilabel classification." Machine Learning: ECML 2007: 406-417.
Ueda, N. and K. Saito (2002). "Parametric mixture models for multi-labeled text."
Advances in neural information processing systems 15: 721-728.
Vens, C., J. Struyf, et al. (2008). "Decision trees for hierarchical multi-label
classification." Machine Learning 73(2): 185-214.
Weifeng Liu, V. D. C., Catherine Huang, Kenneth E. Hild, Ken Montanez. (2012). "The
Eighth
Annual
MLSP
Competition."
from
https://sites.google.com/site/amazonaccessdatacompetition/.
Yan, R., J. Tesic, et al. (2007). Model-shared subspace boosting for multi-label
classification, ACM.
Yang, Y. (1999). "An evaluation of statistical approaches to text categorization."
Information retrieval 1(1): 69-90.
Zhang, M. L. and Z. H. Zhou (2006). "Multilabel neural networks with applications to
functional genomics and text categorization." Knowledge and Data
Engineering, IEEE Transactions on 18(10): 1338-1351.
Zhang, M. L. and Z. H. Zhou (2007). "ML-KNN: A lazy learning approach to multilabel leaming." Pattern Recognition 40(7): 2038-2048.
Zhu, S., X. Ji, et al. (2005). Multi-labelled classification using maximum entropy
method, ACM.