Document 11049527

advertisement
AN ABSTRACT OF THE PROJECT REPORT OF
Qi Lou for the degree of Master of Science in Computer Science presented on
11/26/2013.
Title: Novelty Detection Under The Multi-Instance Multi-Label Framework
Abstract approved:
Raviv Raich
Novelty detection plays an important role in machine learning and signal processing. This
project studies novelty detection in a new setting where the data object is represented as
a bag of instances and associated with multiple class labels, referred to as multi-instance
multi-label (MIML) learning. Contrary to the common assumption in MIML that each
instance in a bag belongs to one of the known classes, in novelty detection, we focus on
the scenario where bags may contain novel-class instances. The goal is to determine,
for any given instance in a new bag, whether it belongs to a known class or a novel
class. Detecting novelty in the MIML setting captures many real-world phenomena and
has many potential applications. For example, in a collection of tagged images, the tag
may only cover a subset of objects existing in the images. Discovering an object whose
class has not been previously tagged can be useful for the purpose of soliciting a label
for the new object class. To address this novel problem, we present a discriminative
framework for detecting new class instances. Experiments demonstrate the effectiveness
of our proposed method, and reveal that the presence of unlabeled novel instances in
training bags is helpful to the detection of such instances in testing stage.
To the best of our knowledge, novelty detection in the MIML setting has not been
investigated. Our main contributions are: (i) We propose a new problem – novelty
detection in the MIML setting. (ii) We offer a framework based on score functions to
solve the problem. (iii) We illustrate the efficacy of our method on a real-world MIML
bioacoustics data.
c
⃝
Copyright
by Qi Lou
11/26/2013
All Rights Reserved
Novelty Detection Under The Multi-Instance Multi-Label
Framework
by
Qi Lou
A PROJECT REPORT
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Presented 11/26/2013
Commencement June 2014
Master of Science project report of Qi Lou presented on 11/26/2013.
APPROVED:
Major Professor, representing Computer Science
Director of the School of Electrical Engineering and Computer Science
Dean of the Graduate School
I understand that my project report will become part of the permanent collection of
Oregon State University libraries. My signature below authorizes release of my project
report to any reader upon request.
Qi Lou, Author
ACKNOWLEDGEMENTS
First, I would like to appreciate the committee members, Prof. Raviv Raich, who has
influenced most on my work and offered an invaluable chance to me so that I can study
here for two years, Prof. Xiaoli Fern who taught me a lot and gave me much help in
research, and Prof. Prasad Tadepalli who is very nice to his students and always willing
to help. I would like to thank Prof. David Mellinger who funded me for a year. I would
like to give many thanks to my lab mates, Behrouz, Evgenia, Gaole, Greg, and Zeyu,
also to the guys in the bioacoustics group, Forrest and Yuanli. Appreciation also goes to
my friend Wei Ma for his kind reminder. I also want to thank other many friends whose
names can not be listed here due to space limit. Without these people I could not be
able to have gone so far.
TABLE OF CONTENTS
Page
1 INTRODUCTION
1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Bioacoustics – A Motivating Application . . . . . . . . . . . . . . . . . . . .
3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2 LITERATURE REVIEW
6
2.1 Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2 Multi-Instance Multi-Label Learning . . . . . . . . . . . . . . . . . . . . . .
8
3 PROBLEM FORMULATION
9
3.1 A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 KERNEL METHODS
9
11
4.1 Kernel Based Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 EXPERIMENTAL RESULTS
15
5.1 MNIST Handwritten Digits Dataset . . . . . . . . . . . . . . . . . . . . . . 15
5.2 HJA Birdsong Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Comparison with One-Class SVM
6 CONCLUSIONS
. . . . . . . . . . . . . . . . . . . . . . . 20
24
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bibliography
27
Appendices
30
A
One-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
TABLE OF CONTENTS (Continued)
Page
B
ROC and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
LIST OF FIGURES
Figure
1.1
Page
A spectrogram generated from some birdsong recording. (a) shows a spectrogram that contains bird syllables. (b) marks those syllables with yellow
brushstrokes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
5.1
Typical examples of ROCs from the handwritten digit data. The subfigure
(a) shows a ROC example from the first setting and the subfigure (b) gives
an example from the second setting. . . . . . . . . . . . . . . . . . . . . . 22
5.2
Variation of AUC as the ratio of novel instances changes. The error bars
stand for standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . 23
LIST OF TABLES
Table
Page
3.1
Toy problem with two known classes . . . . . . . . . . . . . . . . . . . . .
9
3.2
Co-occurrence rates for the toy problem . . . . . . . . . . . . . . . . . . . 10
5.1
Bag examples for the handwritten digits data. We take the first four
digits ‘0’, ‘1’, ‘2’, ‘3’ as known classes, i.e., Y ={‘0’, ‘1’, ‘2’, ‘3’ }. In each
bag, some instances are without associated labels. For example, in bag 1
examples for ‘5’ and ‘9’ are considered from unknown classes. . . . . . . . 15
5.2
Examples for numbers of each digit in 5 bags when each component of β
is 0.1. The bag size is set to be 20. . . . . . . . . . . . . . . . . . . . . . . 16
5.3
Average AUCs for handwritten digits data. Y is the known label set.
Training bags and testing bags are both generated according to Algorithm 2, i.e., without bag filtration. . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4
Average AUCs for handwritten digits data. Y is the known label set.
Training bags are generated according to Algorithm 3, i.e., with bag filtration, while testing bags are generated by Algorithm 2, i.e., without bag
filtration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5
Names of bird species and the number of total instances for each species.
Each species corresponds to one class. . . . . . . . . . . . . . . . . . . . . 19
5.6
Average AUCs for birdsong data. Y is the known label set. . . . . . . . . 20
5.7
Average AUCs for the handwritten digits data by applying one-class SVM
with Gaussian kernel. Y is the known label set. . . . . . . . . . . . . . . . 21
5.8
Average AUCs for the birdsong data by applying one-class SVM. Y is the
known label set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
LIST OF ALGORITHMS
Algorithm
Page
1
Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2
Bag generation procedure for handwritten digits data . . . . . . . . . . . 16
3
Bag generation procedure with filtration for handwritten digits data. . . . 17
Chapter 1: INTRODUCTION
1.1 Background
Novelty detection is the identification of new or unknown data that a learning system
has not been trained with or was not previously aware of [1]. Novelty detection plays
an important role in machine learning. In certain cases, the training data may contain
either unlabeled or incorrectly labeled examples. Moreover, some problems in machine
learning consider the case that examples may come from a class for which no labels
are provided. Therefore, the capability to detect an example from known or unknown
classes is a crucial factor for a learning system. Novelty detection has a wide range
of applications [1] that include intrusion detection, fraud detection, hand-written digit
recognition and statistical process control.
In the traditional setting, only training examples from a nominal distribution are provided and the goal is to determine for a new example whether it comes from the nominal
distribution or not. In this project, we consider novelty detection in a new setting where
the data follows a multi-instance multi-label (MIML) format. The MIML framework
has been primarily studied for supervised learning [2] and widely used in applications
where data is associated with multiple classes and can be naturally represented as bags
of instances (i.e., collections of parts). For example, a document can be viewed as a
bag of words and associated with multiple tags. Similarly, an image can be represented
as a bag of pixels or patches, and associated with multiple classes corresponding to the
2
objects that it contains. Formally speaking, the training data in MIML consists of a
collection of labeled bags {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}, where Xi ⊂ X is a set of
instances and Yi ⊂ Y is a set of labels. In the traditional MIML applications the goal
is to learn a bag-level classifier f : 2X → 2Y that can reliably predict the label set of a
previously unseen bag.
It is commonly assumed in MIML that every instance we observe in the training set
belongs to one of the known classes. However, in many applications, this assumption is
violated. For example, in a collection of tagged images, the tag may only cover a subset
of objects present in the images. The goal of novelty detection in the MIML setting is
to determine whether a given instance comes from an unknown class given only a set of
bags labeled with the known classes. This setup has several advantages compared to the
more well-known setup in novelty detection: First, the labeled bags allow us to apply
an approach that takes into account the presence of multiple known classes. Second,
frequently the training set would contain some novel class instances. The presence of
such instances, although never explicitly labeled as novel instances, can in a way serve as
“implicit” negative examples for the known classes, which can be helpful for identifying
novel instances in new bags.
The work presented in this report is inspired by a real world bioacoustics application. In
this application, the annotation of individual bird vocalization is often a time consuming
task. As an alternative, experts identify from a list of focal bird species the ones that
they recognize in a given recording. Such labels are associated with the entire recording
and not with a specific vocalization in the recording. Based on a collection of such
labeled recordings, the goal is to annotate each vocalization in a new recording [3]. An
implicit assumption here is that each vocalization in the recording must come from one
3
of the focal species, which can be incomplete. Under this assumption, vocalizations of
new species outside of the focal list will not be discovered. Instead, such vocalizations
will be annotated with a label from the existing species list. The setup proposed in
this project allows for novel instances to be observed in the training data without being
explicitly labeled, and hence should enable the annotation of vocalizations from novel
species. In turn, such novel instances can be presented back to the experts for further
inspection.
1.2 Bioacoustics – A Motivating Application
Bioacoustics commonly refers to the study on sounds of animals. In our research, we
examine a collection of audio recordings of birds. For each recording in the collection,
experts only indicate presence or absence of species from a given list to avoid the laborintensive task of labeling each vocalization in the recording. However, we are interested
in annotating each vocalization in a newly given recording.
We observe that this problem has a MIML structure. When dealing with recordings,
we usually convert them to spectrograms of a short time duration. Figure 1.1 provides
an example of spectrogram of 10 seconds. The yellow blocks are syllables of a birdsong.
By extracting features from each syllable, a spectrogram can be represented as a bag
of feature vectors. Since vocalizations of several focal species may appear in the same
spectrogram, a spectrogram can be associated with multiple labels that indicate the
present bird species. Therefore, a spectrogram is a bag of instances associated with
multiple bag-level labels.
It is possible that some syllables in a spectrogram do not come from the birds in the list.
4
Our goal in this work is to identify such syllables, which may help to discover new bird
species. Thus, our task is to detect novelty under the MIML framework.
(a)
(b)
Figure 1.1: A spectrogram generated from some birdsong recording. (a) shows a spectrogram that contains bird syllables. (b) marks those syllables with yellow brushstrokes.
1.3 Outline
In Chapter 2, we present a literature review. We describe previous work on novelty
detection and multi-instance multi-label learning in detail.
Chapter 3 gives a toy example to illustrate the intuition behind novelty detection in the
MIML setting. It also indicates the use of score functions.
Chapter 4 introduces the mathematical formulation for the problem. We provide a training method for instance-level score functions that are based on bag-level data. Those
instance-level score functions are appropriately combined to detect whether a given in-
5
stance belongs to a known class or not.
Chapter 5 presents all the experimental results. The results are based on both synthetic
data and real-world data that demonstrate the effectiveness of our algorithm. We also
present a baseline approach, one-class SVM that is a well-known anomaly detection
algorithm to make comparison to ours.
Chapter 6 concludes this report. We summarize the main contents of the project, point
out our major contributions, and sketch possible directions for future work.
6
Chapter 2: LITERATURE REVIEW
In this chapter, we review work related to novelty detection and multi-instance multilabel learning. Note that under this review of novelty detection, we also include anomaly
detection despite some subtle differences [4].
2.1 Novelty Detection
Novelty detection is important and much work has been done in this field. According
to [1, 5], early work (before 2003) is generally divided into two categories.
One category includes statistical approaches. Those approaches generally try to discover
the statistical properties from training data and then use such information to determine
whether a given example comes from the same distribution or not. [6] calculates the
distance between a given example and a class mean and set a threshold based on the
standard deviation. In [7], the authors use a box plot which is a display of lower extreme,
lower quartile, median, upper quartile and upper extreme, to generate rejection region
for outliers. Many approaches try to estimate the density function from the known
class directly and then set a threshold to decide whether a given example is generated
from that distribution or not. Parametric approaches usually assume that the data is
generated from a known distribution (e.g., Gaussian) with unknown parameters and then
estimate parameters from the data [8]. Contrastively, non-parametric approaches do not
7
have to make assumptions on the underlying distribution of the data while doing density
estimation. Nearest neighbor based methods [9] are typical examples.
The other category consists of machine learning based approaches which include multilayer perceptrons, self organising maps and support vector machines. In [10], Bishop
investigates the relationship between the degree of novelty of input data and the corresponding reliability of the outputs from the network. He quantifies the novelty by
modeling the unconditional probability density of the input data used while training.
LeCun et al. [11] apply back-propagation networks to the handwritten digit recognition.
They introduce three conditions to to generate rejection region. In [12], Singh et al.
introduce a rejection filter that only classifies those known examples. They determine
novel examples by comparing fuzzy clusters of those examples to examples from known
classes. In [13], SVMs are applied to novelty detection to learn a function f that is
positive on a subset S of the input space and negative outside S.
Several new approaches have been introduced in recent years. In [14], geometric entropy minimization is introduced for anomaly detection. An efficient anomaly detection
method using bipartite k-NN graphs is presented in [15]. In [16], an anomaly detection
algorithm is proposed based on score functions. Each point gets scores from its nearest
neighbors. This algorithm can be directly applied to novelty detection.
For an extensive review on novelty detection and anomaly detection, we refer the reader
to [1, 4, 5, 17–19].
8
2.2 Multi-Instance Multi-Label Learning
Multi-instance multi-label learning (MIML) is a relatively new setting and a hot research
topic in recent years. Because of its general structure, many real-world problems naturally fit this setting, hence it has a wide application in machine learning, computer vision
and data mining. Zhou et al. [20] apply this setting to scene classification. They revise
traditional boosting and SVM algorithms to their MIMLBoost and MIMLSVM. They
use these MIML algorithms to scene classification and demonstrate their advantages by
comparison with some multi-instance learning algorithms. Zha et al. [21] propose an
integrated MIML approach based on hidden conditional random fields. They apply this
framework to image classification and reach superior performance compared to the stateof-the-art algorithms. In [22], Surdeanu et al. propose a method to MIML learning for
relation extraction. It jointly models all the instances of a pair of entities in text and all
their labels via a graphical model with latent variables.
9
Chapter 3: PROBLEM FORMULATION
3.1 A Toy Example
Suppose we are given a collection of labeled bags {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )},
where the ith bag Xi ⊂ X is a set of instances from the feature space X ⊂ Rd , and
∪
Yi is a subset of the know label set Y = N
i=1 Yi . For any label yim ∈ Yi , there is at
least one instance xin ∈ Xi belonging to this class. We consider the scenario where
an instance in Xi has no label in Yi related to it, which extends the traditional MIML
learning framework. Our goal is to determine for a given instance x ∈ X whether it
belongs to a known class in Y or not.
To illustrate the intuition behind our general strategy, consider the toy problem shown in
Table 3.1. The known label set is {I,II}. We have four labeled bags available. According
to the principle that one instance must belong to one class and one bag-level label must
have at least one corresponding instance, we conclude that △ is drawn from class I, belongs to class II, and ♢ doesn’t come from the existing classes. ▽ cannot be fully
determined based on current data.
#
1
2
3
4
Bags (Xi )
{△△ ▽}
{△△ ♢▽}
{♢♢}
{△△ }
Labels (Yi )
{I, II}
{I}
{II}
{I, II}
Table 3.1: Toy problem with two known classes
10
I
II
△
1
1/2
▽
3/4
1/4
1/2
1
♢
1/4
1/4
Table 3.2: Co-occurrence rates for the toy problem
To express this observation mathematically, we calculate the rate of co-occurrence of an
instance and a label. For example, △ appears with label I together in bags 1, 2, 4 and
they are both missing in bag 3. So, the co-occurrence rate p(△, I) = 4/4 = 1. The label
II is for bags 1, 3, 4, hence p(△, II) = 2/4 = 1/2. All the other rates can be calculated
analogously. Table 3.1 lists all the rates. If we detect an instance based on the maximal
co-occurrence rate with respect to all classes and set a threshold to be 3/4, we will reach
a result that can generally reflect our previous observation.
This example inspires us to devise a general strategy for detection. We introduce a set
of score functions, each of which corresponds to one class, i.e., for each label c ∈ Y , we
assign a function fc to class c. Generally, for an instance from a specific known class,
the value of the score function corresponding to this class should be large. If all scores
of an instance are below a prescribed threshold, it would not be considered to belong
to any known class. The decision principle is: If maxc∈{1,...,|Y |} fc (x) < ε then return
‘unknown’, otherwise return ‘known’.
There are many possible choices for the set of score functions. Generally, the score
functions are expected to enable us to achieve a high true positive rate with a given false
positive (Type I error) rate, which can be measured by the area under the curve (AUC)
of ROC. For the properties of ROC and AUC, please refer to Appendix B.
11
Chapter 4: KERNEL METHODS
4.1 Kernel Based Scoring Functions
We define the score function for class c as follows:
fc (x) =
∑
∪
xl ∈ Xi
αcl k(x, xl )
(4.1)
i
= αcT k(x)
where Xi ’s are training bags, xl ’s are training instances from training bags, k(·, ·) is the
kernel function such that k(x) = (k(x, x1 ), . . . , k(x, xL ))T , and αcl ’s are the components
of the weight vector αc = (αc1 , . . . , αcL )T .
We encourage fc to take positive values on instances in class c and negative values on
instances from other classes. Hence, we define the objective function OBJ as
|Y |
|Y |
λ∑ T
1 ∑∑
αc Kαc +
Fc (Xi )
2
N |Y |
c=1
N
(4.2)
i=1 c=1
where
Fc (Xi ) = max{0, 1 − yic max fc (xij )}, yic ∈ {−1, +1}
xij ∈Xi
λ is a regularization parameter, K is the kernel matrix with (i, j)-th entry k(xi , xj ), xi , xj ∈
12
∪
Xk , and yic = +1 if and only if Yi contains the label for class c.
k
In fact, we define an objective function for each class separately and sum over all these
objective functions to construct OBJ . The first term of OBJ controls model complexity.
Fc (·) in the second term of OBJ can be viewed as a bag-level hinge loss for class c,
which is a generalization of the single-instance case. If c is a bag-level label of bag Xi ,
we expect max fc (xij ) to give a high score because there is at least one instance in Xi
xij ∈Xi
is from class c. Other loss functions such as rank loss [3] have already been introduced
for MIML learning.
Our goal is to minimize the objective function which is unfortunately non-convex. However, if we fix the term max fc (xij ), i.e., find the support instance xic such that
xic =
xij ∈Xi
T
argmaxxij ∈Xi αc k(xij ) and
substitute back to the objective function, the resulted
objective function OBJ ∗ will be convex with respect to αc ’s. To solve this convex problem, we deploy the L-BFGS [23] algorithm. The subgradient along αc used in L-BFGS
is computed as follows:
1 ∑
yic k(xic )1{1−yic fc (xic )>0}
N |Y |
N
∇c = λKαc −
(4.3)
i=1
Details can be found in Algorithm 1. This descent method can be applied to any choice
of kernel function and according to our experience it works very well (usually converges
within 30 steps). Note that many algorithms [3, 24] for MIML learning that attempt
to learn an instance-level score functions including the proposed approach are based on
a non-convex objective. Consequently, no global optimum is guaranteed. To reduce
the effect induced by randomness, we usually rerun the algorithm multiple times with
independent random initializations and adopt the result with the smallest value of the
13
objective function.
Algorithm 1 Descent Method
Require: {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}, λ, T .
Randomly initialize all αc ’s s.t. ∥αc1 ∥ = 1
for t = 1 to T do
T
Set xtic = argmaxxij ∈Xi (αct ) k(xij ),
1tic = 1{1−yic fc (xtic )>0} ,
N
∑
∇tc = λKαc − N 1|Y |
yic k(xtic )1tic .
i=1
Plug {xtic } into OBJ to get a convex surrogate OBJ t∗ .
Run L-BFGS with inputs OBJ t∗ , ∇tc to return {αct+1 } and OBJ t+1
end for
return {αcT +1 } and OBJ T +1 .
4.2 Parameter Tuning
In our experiment, we use Gaussian kernel, i.e., k(xi , xj ) = e−γ∥xi −xj ∥ , where ∥ · ∥ is the
2
Euclidean norm. The parameter γ controls the bandwidth of the kernel. Hence, there
are a pair of parameters λ and γ in the objective function required to be determined.
While training, we search in a wide range of values for the parameter pair, and select
the pair with corresponding αc ’s that minimizes
|Y |
N ∑
∑
i=1 c=1
g(yic max fc (xij ))
xij ∈Xi
where g(x) = 1x<0 is the zero-one loss function. Note that
1x<0 is a lower bound of the
hinge loss max{0, 1 − x}.
We vary the value of threshold to generate ROCs while testing. The values of threshold
14
are derived from training examples.
15
Chapter 5: EXPERIMENTAL RESULTS
In this chapter, we provide a number of experimental results based on both synthetic
data and real-world data to show the effectiveness of our algorithm. Additionally, we
present a comparison to one-class SVM, a notable anomaly detection algorithm.
5.1 MNIST Handwritten Digits Dataset
We generated the synthetic data based on the MNIST handwritten digits data set1 . Each
image in the data set is a 28 by 28 bitmap, i.e., a vector of 784 dimensions. By using
PCA, we reduced the dimension of instances to 20.
number
bags
labels
1
‘1’
2
‘0’, ‘1’
3
‘2’, ‘3’
4
‘0’, ‘1’
5
‘0’, ‘2’
Table 5.1: Bag examples for the handwritten digits data. We take the first four digits ‘0’,
‘1’, ‘2’, ‘3’ as known classes, i.e., Y ={‘0’, ‘1’, ‘2’, ‘3’ }. In each bag, some instances are
without associated labels. For example, in bag 1 examples for ‘5’ and ‘9’ are considered
from unknown classes.
We created training and testing bags from the MNIST instances. Some examples for
1
http://www.cs.nyu.edu/~roweis/data.html
16
handwritten digits bags are shown in Table 5.1. Two processes for generating bags are
listed in Algorithm 2 and Algorithm 3. The only difference between these two procedures
is that Algorithm 3 rules out the possibility of a label set for a bag being empty, i.e., a
bag including purely novel examples. For Dirichlet process used in our simulation, we
assigned relatively small concentration parameters β = (β1 , β2 , . . . , β10 ) to the Dirichlet
distribution in order to encourage a sparse label set for a bag, which is common in realworld scenarios. We set all βi = 0.1 and the bag size M = 20. Typical examples of bags
generated from Dirichlet distribution are shown in Table 5.1.
‘0’
0
6
0
0
0
‘1’
1
0
17
0
0
‘2’
0
4
0
1
15
‘3’
0
8
2
0
0
‘4’
5
0
0
2
0
‘5’
5
0
0
16
0
‘6’
2
1
1
1
0
‘7’
3
0
0
0
5
‘8’
4
0
0
0
0
‘9’
0
1
0
0
0
Table 5.2: Examples for numbers of each digit in 5 bags when each component of β is
0.1. The bag size is set to be 20.
Algorithm 2 Bag generation procedure for handwritten digits data
Require: N , M , Y , β.
for i = 1 to N do
Draw M instances {xij } according to the proportion given by Dirichlet (β) distribution.
′
′
Extract labels from xij ’s to form Yi and set Yi = Y ∩ Yi .
end for
We provided our method with bags generated in two different ways:
1. Generate both training and testing bags according to Algorithm 2.
2. Generate training bags according to Algorithm 3 while generate testing bags by
applying Algorithm 2.
17
Algorithm 3 Bag generation procedure with filtration for handwritten digits data.
Require: N , M , Y , β.
for i = 1 to N do
Set Yi = ∅.
while Yi == ∅ do
Draw M instances {xij } according to the proportion given by Dirichlet (β) distribution.
′
′
Extract labels from xij ’s to form Yi and set Yi = Y ∩ Yi .
end while
end for
In our experiments, we consider various sizes of known label sets and different combinations of labels in these two settings. Two typical examples of ROCs from the two setting
are shown in Figure 5.1.
Table 5.1 shows the average AUCs of ROCs over multiple runs from the first setting.
We observe that average AUCs are all above 0.85 for the known label sets of size 4. For
the known label sets of size 8, the average AUCs are all larger than 0.8. The results are
fairly stable with different combinations of labels. This demonstrates the effectiveness
of our algorithm.
Table 5.1 shows the average AUCs of ROCs for the setting which does not contain
bags with an empty label set. The label sets in these two tables are the same. The
results in the two tables are comparable but those in Table 5.1 are always better. This
demonstrates that it is beneficial to include bags with an empty label set. The reason
could be that those bags contain purely novel examples and hence training on those bags
is very reliable.
To understand to what extend adding novel instances would help, we gradually increase
the percentage of novel instances in the training and testing data. From Figure 5.2,
18
Y
{‘0’,‘1’,‘3’,‘7’}
{‘2’,‘4’,‘7’,‘8’}
{‘2’,‘5’,‘6’,‘7’}
{‘3’,‘5’,‘7’,‘9’}
{‘3’,‘6’,‘8’,‘9’}
AUC
0.89
0.87
0.91
0.85
0.89
Y
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’}
{‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’}
AUC
0.85
0.88
0.84
0.85
0.83
Table 5.3: Average AUCs for handwritten digits data. Y is the known label set. Training
bags and testing bags are both generated according to Algorithm 2, i.e., without bag
filtration.
Y
{‘0’,‘1’,‘3’,‘7’}
{‘2’,‘4’,‘7’,‘8’}
{‘2’,‘5’,‘6’,‘7’}
{‘3’,‘5’,‘7’,‘9’}
{‘3’,‘6’,‘8’,‘9’}
AUC
0.86
0.86
0.88
0.83
0.86
Y
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’}
{‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’}
AUC
0.85
0.84
0.82
0.84
0.80
Table 5.4: Average AUCs for handwritten digits data. Y is the known label set. Training
bags are generated according to Algorithm 3, i.e., with bag filtration, while testing bags
are generated by Algorithm 2, i.e., without bag filtration.
we can observe that when the ratio of novel instances increases from 0.1 to 0.5, the
performance first increases until reaches 0.3 and then levels off. This trend suggests that
by increasing the number of novel examples toward a balanced dataset, the accuracy
increases. Of course, the increase is only noticeable within a small percentage.
5.2 HJA Birdsong Dataset
We tested our algorithm on the real-world dataset - HJA birdsong dataset2 , which has
been used in [25, 26]. This dataset consists of 548 bags, each of which contains several
38-dimensional instances. The bag size, i.e., the number of instances in a bag, varies
2
Available on-line http://web.engr.oregonstate.edu/~briggsf/kdd2012datasets/hja_birdsong/
19
vary from 1 to 26, the average of which is approximately 9. The dataset includes 4998
instances from 13 species. Species names and the numbers of instances for those species
are listed in Table 5.2. Each species corresponds to a class in the complete label set
{1, 2, . . . , 13}. We took a subset of the complete label set as the known label set and
conducted experiment with various choices of the known label set. Table 5.2 shows the
average AUCs of different known label sets. Specifically, we intentionally made each
species appear at least once in those known sets. From Table 5.2, we observe that most
all of the values of AUCs are above 0.85 and some even reach 0.9. The results are quite
stable with different label settings despite the imbalance in the instance population of
the species. These results illustrate the potential of the approach as a utility for novel
species discovery.
Class
1
2
3
4
5
6
7
8
9
10
11
12
13
Species
Brown Creeper
Winter Wren
Pacific-slope Flycatcher
Red-breasted Nuthatch
Dark-eyed Junco
Olive-sided Flycatcher
Hermit Thrush
Chestnut-backed Chickadee
Varied Thrush
Hermit Warbler
Swainson’s Thrush
Hammond’s Flycatcher
Western Tanager
No. of Instances
602
810
501
494
82
277
32
345
139
120
190
1280
126
Table 5.5: Names of bird species and the number of total instances for each species.
Each species corresponds to one class.
20
Y
{1,2,4,8}
{3,5,7,9}
{4,6,8,10}
{5,7,9,11}
{6,10,12,13}
AUC
0.90
0.85
0.88
0.90
0.89
Y
{1,2,3,4,5,6,7,8}
{3,4,5,6,7,8,9,10}
{5,6,7,8,9,10,11,12}
{1,7,8,9,10,11,12,13}
{1,2,3,9,10,11,12,13}
AUC
0.89
0.85
0.89
0.84
0.85
Table 5.6: Average AUCs for birdsong data. Y is the known label set.
5.3 Comparison with One-Class SVM
Our algorithm deals with detection problem with MIML setting, which is different from
the traditional setting for anomaly detection. We argue that traditional anomaly detection algorithms cannot be directly applied to our problem. To make comparison,
we adopt one-class SVM [27–29], a well known algorithm for anomaly detection. For a
detailed introduction of this algorithm, please refer to Appendix A. To apply one-class
SVM, we construct a normal class training data consisting of examples from the known
label set. The parameter ν vary from 0 to 1 with step size 0.02 to generate ROCs. The
Gaussian kernel is used for one-class SVM. We search the parameter γ for the kernel
in a wide range and select the best one for on-class SMV post-hoc. We present this
unfair advantage to one-class SVM for two reasons: (i) It is unclear how to optimize the
parameter in the absence of novel instances. (ii) We would like to illustrate the point
that even given such unfair advantage, one-class SVM cannot outperform our algorithm.
Table 5.3 and 5.3 show the average AUCs for handwritten digits data and birdsong data
respectively. Compared to Table 5.1 and 5.2, the proposed algorithm outperforms 1class SVM in terms of AUC not only in absolute value but also in stability. This also
demonstrates that training with unlabeled instances are beneficial to the detection.
21
Y
{‘0’,‘1’,‘3’,‘7’}
{‘2’,‘4’,‘7’,‘8’}
{‘2’,‘5’,‘6’,‘7’}
{‘3’,‘5’,‘7’,‘9’}
{‘3’,‘6’,‘8’,‘9’}
AUC
0.66
0.66
0.57
0.65
0.63
Y
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’}
{‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’}
{‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’}
AUC
0.59
0.63
0.68
0.62
0.65
Table 5.7: Average AUCs for the handwritten digits data by applying one-class SVM
with Gaussian kernel. Y is the known label set.
Y
{1,2,4,8}
{3,5,7,9}
{4,6,8,10}
{5,7,9,11}
{6,10,12,13}
AUC
0.78
0.79
0.82
0.73
0.78
Y
{1,2,3,4,5,6,7,8}
{3,4,5,6,7,8,9,10}
{5,6,7,8,9,10,11,12}
{1,7,8,9,10,11,12,13}
{1,2,3,9,10,11,12,13}
AUC
0.85
0.82
0.75
0.70
0.60
Table 5.8: Average AUCs for the birdsong data by applying one-class SVM. Y is the
known label set.
22
1
True Positive Rate
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
False Positive Rate
0.8
1
0.8
1
(a)
1
True Positive Rate
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
False Positive Rate
(b)
Figure 5.1: Typical examples of ROCs from the handwritten digit data. The subfigure
(a) shows a ROC example from the first setting and the subfigure (b) gives an example
from the second setting.
23
1
0.95
AUC
0.9
0.85
0.8
0.75
0.7
0
0.1
0.2
0.3
0.4
Novelty Ratio
0.5
0.6
Figure 5.2: Variation of AUC as the ratio of novel instances changes. The error bars
stand for standard deviation.
24
Chapter 6: CONCLUSIONS
6.1 Summary
In this report, we proposed a new problem – novelty detection in the MIML setting
and offered a framework based on score functions to solve the problem. A large number
of simulations show that our algorithm not only works well on synthetic data but also
on real-world data. Comparison with one-class SVM also shows the superiority of our
method. We demonstrate that the presence of unlabeled examples in the training set
is useful to detect new class examples while testing. We present the advantage in the
MIML setting for novelty detection. Even though positive examples for the novelty that
are not directly labeled, their presence provides a clear advantage over methods that rely
on data that does not include novel class examples.
6.2 Contributions
To the best of our knowledge, novelty detection in the MIML setting has not been
investigated yet. Our main contributions are:
• We raise a novel problem – novelty detection in the MIML setting.
• We produce a framework based on score functions and provide a practical algorithm
to solve the problem.
25
• We illustrate the efficacy of our method not only on the synthetic hand-written
digit data, but also on a real-world MIML bioacoustics data.
6.3 Publications
The following is a list of publications that result from the author’s MS research.
1. Qi Lou, Raviv Raich, Forrest Briggs, Xiaoli Z. Fern, “Novelty Detection Under
Multi-Label Multi-Instance Framework”, accepted to IEEE International Workshop on
Machine Learning for Signal Processing (MLSP), 2013.
2. Forrest Briggs, Xiaoli Fern, Raviv Raich, Qi Lou, “Instance Annotation for Multiinstance Multi-Label Learning”, accepted to ACM Transactions on Knowledge Discovery
from Data (TKDD), 2013.
6.4 Future Work
There are many related problems that call for further investigation. One is on how to
incorporate bag-level label information in detection. In our setting, we only make use
of bag-level information in training not in the following detection since such information
may not be available in this phase. However, if bag-level labels are always available, we
may use such information properly in the detection phase so that it can rule out some
possibility that the classifier makes some obvious mistakes. This will possibly improve
the performance of our algorithm.
Theoretical work such as PAC-Learnability might be necessary to guarantee the fea-
26
sibility of novelty detection in the MIML setting. Currently we are not sure whether
bag-level information is enough for detection in instance-level or under what conditions,
such detection is possible.
We propose to explore applications of our approach on many other real-world datasets
naturally equipped with the MIML structure. For example, in computer vision, images
are often provided with tags which can indicate the presence of objects in the images.
By extracting bag-level information from each image (e.g., using patch representation),
a collection of images can be represented in the MIML setting. We suggest using our
method interactively with active learning approaches to classify whether an instance
(e.g., a patch or a region) is a previously seen type or a new one.
27
Bibliography
[1] Markos Markou and Sameer Singh, “Novelty detection: A review - part 1: Statistical
approaches,” Signal Processing, vol. 83, pp. 2481–2497, 2003.
[2] Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang, and Yu-Feng Li, “Multi-instance
multi-label learning,” Artif. Intell., vol. 176, no. 1, pp. 2291–2320, 2012.
[3] Forrest Briggs, Xiaoli Z. Fern, and Raviv Raich, “Rank-loss support instance machines for miml instance annotation,” in KDD, 2012.
[4] Varun Chandola, Arindam Banerjee, and Vipin Kumar, “Anomaly detection: A
survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, July 2009.
[5] Markos Markou and Sameer Singh, “Novelty detection: A review - part 2: Neural
network based approaches,” Signal Processing, vol. 83, pp. 2499–2521, 2003.
[6] G. Manson, G. Pierce, K. Worden, T. Monnier, P. Guy, and K. Atherton, “Long
term stability of normal condition data for novelty detection,” in Proceedings of 7th
International Symposium on Smart Structures and Materials, California, 2000.
[7] Jorma Laurikkala, Martti Juhola, and Erna Kentala, “Informal identification of
outliers in medical data,” 2000.
[8] M.J. Desforges, P.J. Jacob, and J.E. Cooper, “Applications of probability density
estimation to the detection of abnormal conditions in engineering,” in Proceedings
of Institute of Mechanical Engineers, 1998.
[9] Martin E. Hellman, “The nearest neighbor classification rule with a reject option.,”
IEEE Trans. Systems Science and Cybernetics, vol. 6, no. 3, pp. 179–185, 1970.
[10] Chris M. Bishop, “Novelty detection and neural network validation,” 1994.
[11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in
Advances in Neural Information Processing Systems, 1990, pp. 396–404.
[12] Sameer Singh and Markos Markou, “An approach to novelty detection applied to
the classification of image regions,” IEEE Trans. on Knowl. and Data Eng., vol.
16, no. 4, pp. 396–407, Apr. 2004.
28
[13] Bernhard Schölkopf, Robert C. Williamson, Alex J. Smola, John Shawe-Taylor, and
John C. Platt, “Support vector method for novelty detection,” in NIPS, 1999, pp.
582–588.
[14] Alfred O. Hero, “Geometric entropy minimization (gem) for anomaly detection and
localization,” in NIPS. 2006, pp. 585–592, MIT Press.
[15] Kumar Sricharan and Alfred O. Hero, “Efficient anomaly detection using bipartite
k-nn graphs,” in NIPS, 2011, pp. 478–486.
[16] Manqi Zhao and Venkatesh Saligrama, “Anomaly detection with score functions
based on nearest neighbor graphs,” in NIPS, 2009, pp. 2250–2258.
[17] Victoria Hodge and Jim Austin, “A survey of outlier detection methodologies,”
Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, Oct. 2004.
[18] Malik Agyemang, Ken Barker, and Rada Alhajj, “A comprehensive survey of numeric and symbolic outlier mining techniques,” Intell. Data Anal., vol. 10, no. 6,
pp. 521–538, Dec. 2006.
[19] Animesh Patcha and Jung-Min Park, “An overview of anomaly detection techniques:
Existing solutions and latest technological trends,” Comput. Netw., vol. 51, no. 12,
pp. 3448–3470, Aug. 2007.
[20] Zhi hua Zhou and Min ling Zhang, “Multi-instance multilabel learning with application to scene classification,” in In Advances in Neural Information Processing
Systems 19, 2007.
[21] Zheng-Jun Zha, Xian-Sheng Hua, Tao Mei, Jingdong Wang, Guo-Jun Qi, and Zengfu Wang, “Joint multi-label multi-instance learning for image classification,” in 2008
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA, 2008.
[22] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning,
“Multi-instance multi-label learning for relation extraction,” in Proceedings of the
2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, 2012, EMNLP-CoNLL ’12, pp. 455–465.
[23] Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel, “Representations of
quasi-newton matrices and their use in limited memory methods,” 1994.
[24] Oksana Yakhnenko and Vasant Honavar, “Multi-instance multi-label learning for
image classification with large vocabularies,” in Proceedings of the British Machine
Vision Conference. 2011, pp. 59.1–59.12, BMVA Press.
29
[25] Forrest Briggs, Xiaoli Z. Fern, Raviv Raich, and Qi Lou, “Instance annotation for
multi-instance multi-label learning,” Transactions on Knowledge Discovery from
Data (TKDD), 2012.
[26] Li-Ping Liu and Thomas G. Dietterich, “A conditional multinomial mixture model
for superset label learning,” in NIPS, 2012, pp. 557–565.
[27] Bernhard Schölkopf, John C. Platt, John Shawe-taylor, Alex J. Smola, and
Robert C. Williamson, “Estimating the support of a high-dimensional distribution,” 1999.
[28] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector
machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp.
27:1–27:27, 2011.
[29] Larry M. Manevitz, Malik Yousef, Nello Cristianini, John Shawe-taylor, and Bob
Williamson, “One-class svms for document classification,” Journal of Machine
Learning Research, vol. 2, pp. 139–154, 2001.
[30] Andrew P. Bradley, “The use of the area under the roc curve in the evaluation of
machine learning algorithms,” Pattern Recognition, vol. 30, pp. 1145–1159, 1997.
[31] Thomas A. Lasko, Jui G. Bhagwat, Kelly H. Zou, and Lucila Ohno-Machado, “The
use of receiver operating characteristic curves in biomedical informatics,” J. of
Biomedical Informatics, vol. 38, no. 5, pp. 404–415, Oct. 2005.
[32] M. H. Zweig and G. Campbell, “Receiver-operating characteristic (ROC) plots: a
fundamental evaluation tool in clinical medicine.,” Clinical chemistry, vol. 39, no.
4, pp. 561–577, Apr. 1993.
30
APPENDICES
31
Appendix A: One-class SVM
To make this report self-contained, we provide the details of one-class SVM here, which
are mainly based on [27, 29].
We suppose that samples are generated from an underlying distribution in the feature
space. The general idea of one-class SVM is to find a subset S of the feature space that
a new sample lies outside this region with an apriori-specified probability ν ∈ (0, 1). We
expect S to be as small as possible. A classification function f takes value +1 in S and
−1 outside. It is different from the two-class problem since only one-class examples are
available in the training phase.
Practically, for a given dataset {x1 , x2 , . . . , xn }, the algorithm does as follows:
• First, it maps the data into a feature space via feature map Φ(·).
• Second, it separates the data from the origin using a hyperplane which is given by
< w, Φ(x) >= ρ.
• Third, it penalizes outliers by adding the slack variables ξi .
• Last, it trades off between model complexity and training errors via ν.
Mathematically, it suffices to solve the following optimization problem:
n
1 ∑
1
2
ξi − ρ
min ∥w∥ +
2
nν
i=1
32
s.t.
< w, Φ(xi ) > ≥ ρ − ξi
ξi ≥ 0
i = 1, 2, . . . , n
This formulation can be easily kernelized as regular SVM when we check its dual formulation. One can use LIBSVM [28] to solve this problem as we did in our implementation.
The decision function f is given by f (x) = sgn(< w, Φ(x) > −ρ).
33
Appendix B: ROC and AUC
Receiver operating characteristic (ROC) is a standard indicator that illustrates the performance of a binary classifier when the threshold varies. It plots the true positive rate
(TPR) versus false positive rate at various threshold settings. It provides a powerful tool
in model selection and decision making, hence it is widely used in many fields such as
machine learning [30], biomedical informatics [31], clinical medicine [32], etc.
We list the basic concepts in ROC analysis here. Note that in a binary classification
problem, the examples are labeled as positive (P) or negative (N). The classifier will also
give a predicted label to an example. Hence,
• TP: true positive. If the predicted label is positive and the actual label is also
positive.
• TN: true negative. If the predicted label is negative and the actual label is also
negative.
• FP: false positive. If the predicted label is positive and however the actual label is
negative.
• FN: false negative. If the predicted label is negative and however the actual label
is positive.
34
The TPR and FPR are calculated by
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
From the above formula we observe that TPR is percentage of the number of examples
correctly predicted as positive in the total number of positive examples. FPR is percentage of the number of examples wrongly predicted as positive in the total number of
negative examples. For a detector, our goal is that with a low FPR, it returns a high
TPR. This is directly related to the area under a ROC curve, i.e., its AUC. Generally,
a ROC curve is more ideal, its AUC will be closer to 1, which is the supremum. It is
not straightforward to compare two ROC curves, but comparison with their AUCs is
quite simple. Hence, the AUC of a ROC becomes a very useful indicator. Many numerical methods can be used to calculate AUC. In MATLAB, one can invoke the ‘trapz’
function1 which applies the trapezoidal method.
1
http://www.mathworks.com/help/matlab/ref/trapz.html
Download