AN ABSTRACT OF THE PROJECT REPORT OF Qi Lou for the degree of Master of Science in Computer Science presented on 11/26/2013. Title: Novelty Detection Under The Multi-Instance Multi-Label Framework Abstract approved: Raviv Raich Novelty detection plays an important role in machine learning and signal processing. This project studies novelty detection in a new setting where the data object is represented as a bag of instances and associated with multiple class labels, referred to as multi-instance multi-label (MIML) learning. Contrary to the common assumption in MIML that each instance in a bag belongs to one of the known classes, in novelty detection, we focus on the scenario where bags may contain novel-class instances. The goal is to determine, for any given instance in a new bag, whether it belongs to a known class or a novel class. Detecting novelty in the MIML setting captures many real-world phenomena and has many potential applications. For example, in a collection of tagged images, the tag may only cover a subset of objects existing in the images. Discovering an object whose class has not been previously tagged can be useful for the purpose of soliciting a label for the new object class. To address this novel problem, we present a discriminative framework for detecting new class instances. Experiments demonstrate the effectiveness of our proposed method, and reveal that the presence of unlabeled novel instances in training bags is helpful to the detection of such instances in testing stage. To the best of our knowledge, novelty detection in the MIML setting has not been investigated. Our main contributions are: (i) We propose a new problem – novelty detection in the MIML setting. (ii) We offer a framework based on score functions to solve the problem. (iii) We illustrate the efficacy of our method on a real-world MIML bioacoustics data. c ⃝ Copyright by Qi Lou 11/26/2013 All Rights Reserved Novelty Detection Under The Multi-Instance Multi-Label Framework by Qi Lou A PROJECT REPORT submitted to Oregon State University in partial fulfillment of the requirements for the degree of Master of Science Presented 11/26/2013 Commencement June 2014 Master of Science project report of Qi Lou presented on 11/26/2013. APPROVED: Major Professor, representing Computer Science Director of the School of Electrical Engineering and Computer Science Dean of the Graduate School I understand that my project report will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my project report to any reader upon request. Qi Lou, Author ACKNOWLEDGEMENTS First, I would like to appreciate the committee members, Prof. Raviv Raich, who has influenced most on my work and offered an invaluable chance to me so that I can study here for two years, Prof. Xiaoli Fern who taught me a lot and gave me much help in research, and Prof. Prasad Tadepalli who is very nice to his students and always willing to help. I would like to thank Prof. David Mellinger who funded me for a year. I would like to give many thanks to my lab mates, Behrouz, Evgenia, Gaole, Greg, and Zeyu, also to the guys in the bioacoustics group, Forrest and Yuanli. Appreciation also goes to my friend Wei Ma for his kind reminder. I also want to thank other many friends whose names can not be listed here due to space limit. Without these people I could not be able to have gone so far. TABLE OF CONTENTS Page 1 INTRODUCTION 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Bioacoustics – A Motivating Application . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 LITERATURE REVIEW 6 2.1 Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Multi-Instance Multi-Label Learning . . . . . . . . . . . . . . . . . . . . . . 8 3 PROBLEM FORMULATION 9 3.1 A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 KERNEL METHODS 9 11 4.1 Kernel Based Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 EXPERIMENTAL RESULTS 15 5.1 MNIST Handwritten Digits Dataset . . . . . . . . . . . . . . . . . . . . . . 15 5.2 HJA Birdsong Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3 Comparison with One-Class SVM 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . 20 24 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Bibliography 27 Appendices 30 A One-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 TABLE OF CONTENTS (Continued) Page B ROC and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 LIST OF FIGURES Figure 1.1 Page A spectrogram generated from some birdsong recording. (a) shows a spectrogram that contains bird syllables. (b) marks those syllables with yellow brushstrokes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5.1 Typical examples of ROCs from the handwritten digit data. The subfigure (a) shows a ROC example from the first setting and the subfigure (b) gives an example from the second setting. . . . . . . . . . . . . . . . . . . . . . 22 5.2 Variation of AUC as the ratio of novel instances changes. The error bars stand for standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . 23 LIST OF TABLES Table Page 3.1 Toy problem with two known classes . . . . . . . . . . . . . . . . . . . . . 9 3.2 Co-occurrence rates for the toy problem . . . . . . . . . . . . . . . . . . . 10 5.1 Bag examples for the handwritten digits data. We take the first four digits ‘0’, ‘1’, ‘2’, ‘3’ as known classes, i.e., Y ={‘0’, ‘1’, ‘2’, ‘3’ }. In each bag, some instances are without associated labels. For example, in bag 1 examples for ‘5’ and ‘9’ are considered from unknown classes. . . . . . . . 15 5.2 Examples for numbers of each digit in 5 bags when each component of β is 0.1. The bag size is set to be 20. . . . . . . . . . . . . . . . . . . . . . . 16 5.3 Average AUCs for handwritten digits data. Y is the known label set. Training bags and testing bags are both generated according to Algorithm 2, i.e., without bag filtration. . . . . . . . . . . . . . . . . . . . . . . . . 18 5.4 Average AUCs for handwritten digits data. Y is the known label set. Training bags are generated according to Algorithm 3, i.e., with bag filtration, while testing bags are generated by Algorithm 2, i.e., without bag filtration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.5 Names of bird species and the number of total instances for each species. Each species corresponds to one class. . . . . . . . . . . . . . . . . . . . . 19 5.6 Average AUCs for birdsong data. Y is the known label set. . . . . . . . . 20 5.7 Average AUCs for the handwritten digits data by applying one-class SVM with Gaussian kernel. Y is the known label set. . . . . . . . . . . . . . . . 21 5.8 Average AUCs for the birdsong data by applying one-class SVM. Y is the known label set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 LIST OF ALGORITHMS Algorithm Page 1 Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Bag generation procedure for handwritten digits data . . . . . . . . . . . 16 3 Bag generation procedure with filtration for handwritten digits data. . . . 17 Chapter 1: INTRODUCTION 1.1 Background Novelty detection is the identification of new or unknown data that a learning system has not been trained with or was not previously aware of [1]. Novelty detection plays an important role in machine learning. In certain cases, the training data may contain either unlabeled or incorrectly labeled examples. Moreover, some problems in machine learning consider the case that examples may come from a class for which no labels are provided. Therefore, the capability to detect an example from known or unknown classes is a crucial factor for a learning system. Novelty detection has a wide range of applications [1] that include intrusion detection, fraud detection, hand-written digit recognition and statistical process control. In the traditional setting, only training examples from a nominal distribution are provided and the goal is to determine for a new example whether it comes from the nominal distribution or not. In this project, we consider novelty detection in a new setting where the data follows a multi-instance multi-label (MIML) format. The MIML framework has been primarily studied for supervised learning [2] and widely used in applications where data is associated with multiple classes and can be naturally represented as bags of instances (i.e., collections of parts). For example, a document can be viewed as a bag of words and associated with multiple tags. Similarly, an image can be represented as a bag of pixels or patches, and associated with multiple classes corresponding to the 2 objects that it contains. Formally speaking, the training data in MIML consists of a collection of labeled bags {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}, where Xi ⊂ X is a set of instances and Yi ⊂ Y is a set of labels. In the traditional MIML applications the goal is to learn a bag-level classifier f : 2X → 2Y that can reliably predict the label set of a previously unseen bag. It is commonly assumed in MIML that every instance we observe in the training set belongs to one of the known classes. However, in many applications, this assumption is violated. For example, in a collection of tagged images, the tag may only cover a subset of objects present in the images. The goal of novelty detection in the MIML setting is to determine whether a given instance comes from an unknown class given only a set of bags labeled with the known classes. This setup has several advantages compared to the more well-known setup in novelty detection: First, the labeled bags allow us to apply an approach that takes into account the presence of multiple known classes. Second, frequently the training set would contain some novel class instances. The presence of such instances, although never explicitly labeled as novel instances, can in a way serve as “implicit” negative examples for the known classes, which can be helpful for identifying novel instances in new bags. The work presented in this report is inspired by a real world bioacoustics application. In this application, the annotation of individual bird vocalization is often a time consuming task. As an alternative, experts identify from a list of focal bird species the ones that they recognize in a given recording. Such labels are associated with the entire recording and not with a specific vocalization in the recording. Based on a collection of such labeled recordings, the goal is to annotate each vocalization in a new recording [3]. An implicit assumption here is that each vocalization in the recording must come from one 3 of the focal species, which can be incomplete. Under this assumption, vocalizations of new species outside of the focal list will not be discovered. Instead, such vocalizations will be annotated with a label from the existing species list. The setup proposed in this project allows for novel instances to be observed in the training data without being explicitly labeled, and hence should enable the annotation of vocalizations from novel species. In turn, such novel instances can be presented back to the experts for further inspection. 1.2 Bioacoustics – A Motivating Application Bioacoustics commonly refers to the study on sounds of animals. In our research, we examine a collection of audio recordings of birds. For each recording in the collection, experts only indicate presence or absence of species from a given list to avoid the laborintensive task of labeling each vocalization in the recording. However, we are interested in annotating each vocalization in a newly given recording. We observe that this problem has a MIML structure. When dealing with recordings, we usually convert them to spectrograms of a short time duration. Figure 1.1 provides an example of spectrogram of 10 seconds. The yellow blocks are syllables of a birdsong. By extracting features from each syllable, a spectrogram can be represented as a bag of feature vectors. Since vocalizations of several focal species may appear in the same spectrogram, a spectrogram can be associated with multiple labels that indicate the present bird species. Therefore, a spectrogram is a bag of instances associated with multiple bag-level labels. It is possible that some syllables in a spectrogram do not come from the birds in the list. 4 Our goal in this work is to identify such syllables, which may help to discover new bird species. Thus, our task is to detect novelty under the MIML framework. (a) (b) Figure 1.1: A spectrogram generated from some birdsong recording. (a) shows a spectrogram that contains bird syllables. (b) marks those syllables with yellow brushstrokes. 1.3 Outline In Chapter 2, we present a literature review. We describe previous work on novelty detection and multi-instance multi-label learning in detail. Chapter 3 gives a toy example to illustrate the intuition behind novelty detection in the MIML setting. It also indicates the use of score functions. Chapter 4 introduces the mathematical formulation for the problem. We provide a training method for instance-level score functions that are based on bag-level data. Those instance-level score functions are appropriately combined to detect whether a given in- 5 stance belongs to a known class or not. Chapter 5 presents all the experimental results. The results are based on both synthetic data and real-world data that demonstrate the effectiveness of our algorithm. We also present a baseline approach, one-class SVM that is a well-known anomaly detection algorithm to make comparison to ours. Chapter 6 concludes this report. We summarize the main contents of the project, point out our major contributions, and sketch possible directions for future work. 6 Chapter 2: LITERATURE REVIEW In this chapter, we review work related to novelty detection and multi-instance multilabel learning. Note that under this review of novelty detection, we also include anomaly detection despite some subtle differences [4]. 2.1 Novelty Detection Novelty detection is important and much work has been done in this field. According to [1, 5], early work (before 2003) is generally divided into two categories. One category includes statistical approaches. Those approaches generally try to discover the statistical properties from training data and then use such information to determine whether a given example comes from the same distribution or not. [6] calculates the distance between a given example and a class mean and set a threshold based on the standard deviation. In [7], the authors use a box plot which is a display of lower extreme, lower quartile, median, upper quartile and upper extreme, to generate rejection region for outliers. Many approaches try to estimate the density function from the known class directly and then set a threshold to decide whether a given example is generated from that distribution or not. Parametric approaches usually assume that the data is generated from a known distribution (e.g., Gaussian) with unknown parameters and then estimate parameters from the data [8]. Contrastively, non-parametric approaches do not 7 have to make assumptions on the underlying distribution of the data while doing density estimation. Nearest neighbor based methods [9] are typical examples. The other category consists of machine learning based approaches which include multilayer perceptrons, self organising maps and support vector machines. In [10], Bishop investigates the relationship between the degree of novelty of input data and the corresponding reliability of the outputs from the network. He quantifies the novelty by modeling the unconditional probability density of the input data used while training. LeCun et al. [11] apply back-propagation networks to the handwritten digit recognition. They introduce three conditions to to generate rejection region. In [12], Singh et al. introduce a rejection filter that only classifies those known examples. They determine novel examples by comparing fuzzy clusters of those examples to examples from known classes. In [13], SVMs are applied to novelty detection to learn a function f that is positive on a subset S of the input space and negative outside S. Several new approaches have been introduced in recent years. In [14], geometric entropy minimization is introduced for anomaly detection. An efficient anomaly detection method using bipartite k-NN graphs is presented in [15]. In [16], an anomaly detection algorithm is proposed based on score functions. Each point gets scores from its nearest neighbors. This algorithm can be directly applied to novelty detection. For an extensive review on novelty detection and anomaly detection, we refer the reader to [1, 4, 5, 17–19]. 8 2.2 Multi-Instance Multi-Label Learning Multi-instance multi-label learning (MIML) is a relatively new setting and a hot research topic in recent years. Because of its general structure, many real-world problems naturally fit this setting, hence it has a wide application in machine learning, computer vision and data mining. Zhou et al. [20] apply this setting to scene classification. They revise traditional boosting and SVM algorithms to their MIMLBoost and MIMLSVM. They use these MIML algorithms to scene classification and demonstrate their advantages by comparison with some multi-instance learning algorithms. Zha et al. [21] propose an integrated MIML approach based on hidden conditional random fields. They apply this framework to image classification and reach superior performance compared to the stateof-the-art algorithms. In [22], Surdeanu et al. propose a method to MIML learning for relation extraction. It jointly models all the instances of a pair of entities in text and all their labels via a graphical model with latent variables. 9 Chapter 3: PROBLEM FORMULATION 3.1 A Toy Example Suppose we are given a collection of labeled bags {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}, where the ith bag Xi ⊂ X is a set of instances from the feature space X ⊂ Rd , and ∪ Yi is a subset of the know label set Y = N i=1 Yi . For any label yim ∈ Yi , there is at least one instance xin ∈ Xi belonging to this class. We consider the scenario where an instance in Xi has no label in Yi related to it, which extends the traditional MIML learning framework. Our goal is to determine for a given instance x ∈ X whether it belongs to a known class in Y or not. To illustrate the intuition behind our general strategy, consider the toy problem shown in Table 3.1. The known label set is {I,II}. We have four labeled bags available. According to the principle that one instance must belong to one class and one bag-level label must have at least one corresponding instance, we conclude that △ is drawn from class I, belongs to class II, and ♢ doesn’t come from the existing classes. ▽ cannot be fully determined based on current data. # 1 2 3 4 Bags (Xi ) {△△ ▽} {△△ ♢▽} {♢♢} {△△ } Labels (Yi ) {I, II} {I} {II} {I, II} Table 3.1: Toy problem with two known classes 10 I II △ 1 1/2 ▽ 3/4 1/4 1/2 1 ♢ 1/4 1/4 Table 3.2: Co-occurrence rates for the toy problem To express this observation mathematically, we calculate the rate of co-occurrence of an instance and a label. For example, △ appears with label I together in bags 1, 2, 4 and they are both missing in bag 3. So, the co-occurrence rate p(△, I) = 4/4 = 1. The label II is for bags 1, 3, 4, hence p(△, II) = 2/4 = 1/2. All the other rates can be calculated analogously. Table 3.1 lists all the rates. If we detect an instance based on the maximal co-occurrence rate with respect to all classes and set a threshold to be 3/4, we will reach a result that can generally reflect our previous observation. This example inspires us to devise a general strategy for detection. We introduce a set of score functions, each of which corresponds to one class, i.e., for each label c ∈ Y , we assign a function fc to class c. Generally, for an instance from a specific known class, the value of the score function corresponding to this class should be large. If all scores of an instance are below a prescribed threshold, it would not be considered to belong to any known class. The decision principle is: If maxc∈{1,...,|Y |} fc (x) < ε then return ‘unknown’, otherwise return ‘known’. There are many possible choices for the set of score functions. Generally, the score functions are expected to enable us to achieve a high true positive rate with a given false positive (Type I error) rate, which can be measured by the area under the curve (AUC) of ROC. For the properties of ROC and AUC, please refer to Appendix B. 11 Chapter 4: KERNEL METHODS 4.1 Kernel Based Scoring Functions We define the score function for class c as follows: fc (x) = ∑ ∪ xl ∈ Xi αcl k(x, xl ) (4.1) i = αcT k(x) where Xi ’s are training bags, xl ’s are training instances from training bags, k(·, ·) is the kernel function such that k(x) = (k(x, x1 ), . . . , k(x, xL ))T , and αcl ’s are the components of the weight vector αc = (αc1 , . . . , αcL )T . We encourage fc to take positive values on instances in class c and negative values on instances from other classes. Hence, we define the objective function OBJ as |Y | |Y | λ∑ T 1 ∑∑ αc Kαc + Fc (Xi ) 2 N |Y | c=1 N (4.2) i=1 c=1 where Fc (Xi ) = max{0, 1 − yic max fc (xij )}, yic ∈ {−1, +1} xij ∈Xi λ is a regularization parameter, K is the kernel matrix with (i, j)-th entry k(xi , xj ), xi , xj ∈ 12 ∪ Xk , and yic = +1 if and only if Yi contains the label for class c. k In fact, we define an objective function for each class separately and sum over all these objective functions to construct OBJ . The first term of OBJ controls model complexity. Fc (·) in the second term of OBJ can be viewed as a bag-level hinge loss for class c, which is a generalization of the single-instance case. If c is a bag-level label of bag Xi , we expect max fc (xij ) to give a high score because there is at least one instance in Xi xij ∈Xi is from class c. Other loss functions such as rank loss [3] have already been introduced for MIML learning. Our goal is to minimize the objective function which is unfortunately non-convex. However, if we fix the term max fc (xij ), i.e., find the support instance xic such that xic = xij ∈Xi T argmaxxij ∈Xi αc k(xij ) and substitute back to the objective function, the resulted objective function OBJ ∗ will be convex with respect to αc ’s. To solve this convex problem, we deploy the L-BFGS [23] algorithm. The subgradient along αc used in L-BFGS is computed as follows: 1 ∑ yic k(xic )1{1−yic fc (xic )>0} N |Y | N ∇c = λKαc − (4.3) i=1 Details can be found in Algorithm 1. This descent method can be applied to any choice of kernel function and according to our experience it works very well (usually converges within 30 steps). Note that many algorithms [3, 24] for MIML learning that attempt to learn an instance-level score functions including the proposed approach are based on a non-convex objective. Consequently, no global optimum is guaranteed. To reduce the effect induced by randomness, we usually rerun the algorithm multiple times with independent random initializations and adopt the result with the smallest value of the 13 objective function. Algorithm 1 Descent Method Require: {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}, λ, T . Randomly initialize all αc ’s s.t. ∥αc1 ∥ = 1 for t = 1 to T do T Set xtic = argmaxxij ∈Xi (αct ) k(xij ), 1tic = 1{1−yic fc (xtic )>0} , N ∑ ∇tc = λKαc − N 1|Y | yic k(xtic )1tic . i=1 Plug {xtic } into OBJ to get a convex surrogate OBJ t∗ . Run L-BFGS with inputs OBJ t∗ , ∇tc to return {αct+1 } and OBJ t+1 end for return {αcT +1 } and OBJ T +1 . 4.2 Parameter Tuning In our experiment, we use Gaussian kernel, i.e., k(xi , xj ) = e−γ∥xi −xj ∥ , where ∥ · ∥ is the 2 Euclidean norm. The parameter γ controls the bandwidth of the kernel. Hence, there are a pair of parameters λ and γ in the objective function required to be determined. While training, we search in a wide range of values for the parameter pair, and select the pair with corresponding αc ’s that minimizes |Y | N ∑ ∑ i=1 c=1 g(yic max fc (xij )) xij ∈Xi where g(x) = 1x<0 is the zero-one loss function. Note that 1x<0 is a lower bound of the hinge loss max{0, 1 − x}. We vary the value of threshold to generate ROCs while testing. The values of threshold 14 are derived from training examples. 15 Chapter 5: EXPERIMENTAL RESULTS In this chapter, we provide a number of experimental results based on both synthetic data and real-world data to show the effectiveness of our algorithm. Additionally, we present a comparison to one-class SVM, a notable anomaly detection algorithm. 5.1 MNIST Handwritten Digits Dataset We generated the synthetic data based on the MNIST handwritten digits data set1 . Each image in the data set is a 28 by 28 bitmap, i.e., a vector of 784 dimensions. By using PCA, we reduced the dimension of instances to 20. number bags labels 1 ‘1’ 2 ‘0’, ‘1’ 3 ‘2’, ‘3’ 4 ‘0’, ‘1’ 5 ‘0’, ‘2’ Table 5.1: Bag examples for the handwritten digits data. We take the first four digits ‘0’, ‘1’, ‘2’, ‘3’ as known classes, i.e., Y ={‘0’, ‘1’, ‘2’, ‘3’ }. In each bag, some instances are without associated labels. For example, in bag 1 examples for ‘5’ and ‘9’ are considered from unknown classes. We created training and testing bags from the MNIST instances. Some examples for 1 http://www.cs.nyu.edu/~roweis/data.html 16 handwritten digits bags are shown in Table 5.1. Two processes for generating bags are listed in Algorithm 2 and Algorithm 3. The only difference between these two procedures is that Algorithm 3 rules out the possibility of a label set for a bag being empty, i.e., a bag including purely novel examples. For Dirichlet process used in our simulation, we assigned relatively small concentration parameters β = (β1 , β2 , . . . , β10 ) to the Dirichlet distribution in order to encourage a sparse label set for a bag, which is common in realworld scenarios. We set all βi = 0.1 and the bag size M = 20. Typical examples of bags generated from Dirichlet distribution are shown in Table 5.1. ‘0’ 0 6 0 0 0 ‘1’ 1 0 17 0 0 ‘2’ 0 4 0 1 15 ‘3’ 0 8 2 0 0 ‘4’ 5 0 0 2 0 ‘5’ 5 0 0 16 0 ‘6’ 2 1 1 1 0 ‘7’ 3 0 0 0 5 ‘8’ 4 0 0 0 0 ‘9’ 0 1 0 0 0 Table 5.2: Examples for numbers of each digit in 5 bags when each component of β is 0.1. The bag size is set to be 20. Algorithm 2 Bag generation procedure for handwritten digits data Require: N , M , Y , β. for i = 1 to N do Draw M instances {xij } according to the proportion given by Dirichlet (β) distribution. ′ ′ Extract labels from xij ’s to form Yi and set Yi = Y ∩ Yi . end for We provided our method with bags generated in two different ways: 1. Generate both training and testing bags according to Algorithm 2. 2. Generate training bags according to Algorithm 3 while generate testing bags by applying Algorithm 2. 17 Algorithm 3 Bag generation procedure with filtration for handwritten digits data. Require: N , M , Y , β. for i = 1 to N do Set Yi = ∅. while Yi == ∅ do Draw M instances {xij } according to the proportion given by Dirichlet (β) distribution. ′ ′ Extract labels from xij ’s to form Yi and set Yi = Y ∩ Yi . end while end for In our experiments, we consider various sizes of known label sets and different combinations of labels in these two settings. Two typical examples of ROCs from the two setting are shown in Figure 5.1. Table 5.1 shows the average AUCs of ROCs over multiple runs from the first setting. We observe that average AUCs are all above 0.85 for the known label sets of size 4. For the known label sets of size 8, the average AUCs are all larger than 0.8. The results are fairly stable with different combinations of labels. This demonstrates the effectiveness of our algorithm. Table 5.1 shows the average AUCs of ROCs for the setting which does not contain bags with an empty label set. The label sets in these two tables are the same. The results in the two tables are comparable but those in Table 5.1 are always better. This demonstrates that it is beneficial to include bags with an empty label set. The reason could be that those bags contain purely novel examples and hence training on those bags is very reliable. To understand to what extend adding novel instances would help, we gradually increase the percentage of novel instances in the training and testing data. From Figure 5.2, 18 Y {‘0’,‘1’,‘3’,‘7’} {‘2’,‘4’,‘7’,‘8’} {‘2’,‘5’,‘6’,‘7’} {‘3’,‘5’,‘7’,‘9’} {‘3’,‘6’,‘8’,‘9’} AUC 0.89 0.87 0.91 0.85 0.89 Y {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’} {‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’} AUC 0.85 0.88 0.84 0.85 0.83 Table 5.3: Average AUCs for handwritten digits data. Y is the known label set. Training bags and testing bags are both generated according to Algorithm 2, i.e., without bag filtration. Y {‘0’,‘1’,‘3’,‘7’} {‘2’,‘4’,‘7’,‘8’} {‘2’,‘5’,‘6’,‘7’} {‘3’,‘5’,‘7’,‘9’} {‘3’,‘6’,‘8’,‘9’} AUC 0.86 0.86 0.88 0.83 0.86 Y {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’} {‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’} AUC 0.85 0.84 0.82 0.84 0.80 Table 5.4: Average AUCs for handwritten digits data. Y is the known label set. Training bags are generated according to Algorithm 3, i.e., with bag filtration, while testing bags are generated by Algorithm 2, i.e., without bag filtration. we can observe that when the ratio of novel instances increases from 0.1 to 0.5, the performance first increases until reaches 0.3 and then levels off. This trend suggests that by increasing the number of novel examples toward a balanced dataset, the accuracy increases. Of course, the increase is only noticeable within a small percentage. 5.2 HJA Birdsong Dataset We tested our algorithm on the real-world dataset - HJA birdsong dataset2 , which has been used in [25, 26]. This dataset consists of 548 bags, each of which contains several 38-dimensional instances. The bag size, i.e., the number of instances in a bag, varies 2 Available on-line http://web.engr.oregonstate.edu/~briggsf/kdd2012datasets/hja_birdsong/ 19 vary from 1 to 26, the average of which is approximately 9. The dataset includes 4998 instances from 13 species. Species names and the numbers of instances for those species are listed in Table 5.2. Each species corresponds to a class in the complete label set {1, 2, . . . , 13}. We took a subset of the complete label set as the known label set and conducted experiment with various choices of the known label set. Table 5.2 shows the average AUCs of different known label sets. Specifically, we intentionally made each species appear at least once in those known sets. From Table 5.2, we observe that most all of the values of AUCs are above 0.85 and some even reach 0.9. The results are quite stable with different label settings despite the imbalance in the instance population of the species. These results illustrate the potential of the approach as a utility for novel species discovery. Class 1 2 3 4 5 6 7 8 9 10 11 12 13 Species Brown Creeper Winter Wren Pacific-slope Flycatcher Red-breasted Nuthatch Dark-eyed Junco Olive-sided Flycatcher Hermit Thrush Chestnut-backed Chickadee Varied Thrush Hermit Warbler Swainson’s Thrush Hammond’s Flycatcher Western Tanager No. of Instances 602 810 501 494 82 277 32 345 139 120 190 1280 126 Table 5.5: Names of bird species and the number of total instances for each species. Each species corresponds to one class. 20 Y {1,2,4,8} {3,5,7,9} {4,6,8,10} {5,7,9,11} {6,10,12,13} AUC 0.90 0.85 0.88 0.90 0.89 Y {1,2,3,4,5,6,7,8} {3,4,5,6,7,8,9,10} {5,6,7,8,9,10,11,12} {1,7,8,9,10,11,12,13} {1,2,3,9,10,11,12,13} AUC 0.89 0.85 0.89 0.84 0.85 Table 5.6: Average AUCs for birdsong data. Y is the known label set. 5.3 Comparison with One-Class SVM Our algorithm deals with detection problem with MIML setting, which is different from the traditional setting for anomaly detection. We argue that traditional anomaly detection algorithms cannot be directly applied to our problem. To make comparison, we adopt one-class SVM [27–29], a well known algorithm for anomaly detection. For a detailed introduction of this algorithm, please refer to Appendix A. To apply one-class SVM, we construct a normal class training data consisting of examples from the known label set. The parameter ν vary from 0 to 1 with step size 0.02 to generate ROCs. The Gaussian kernel is used for one-class SVM. We search the parameter γ for the kernel in a wide range and select the best one for on-class SMV post-hoc. We present this unfair advantage to one-class SVM for two reasons: (i) It is unclear how to optimize the parameter in the absence of novel instances. (ii) We would like to illustrate the point that even given such unfair advantage, one-class SVM cannot outperform our algorithm. Table 5.3 and 5.3 show the average AUCs for handwritten digits data and birdsong data respectively. Compared to Table 5.1 and 5.2, the proposed algorithm outperforms 1class SVM in terms of AUC not only in absolute value but also in stability. This also demonstrates that training with unlabeled instances are beneficial to the detection. 21 Y {‘0’,‘1’,‘3’,‘7’} {‘2’,‘4’,‘7’,‘8’} {‘2’,‘5’,‘6’,‘7’} {‘3’,‘5’,‘7’,‘9’} {‘3’,‘6’,‘8’,‘9’} AUC 0.66 0.66 0.57 0.65 0.63 Y {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’} {‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘6’,‘7’,‘8’,‘9’} {‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘8’,‘9’} AUC 0.59 0.63 0.68 0.62 0.65 Table 5.7: Average AUCs for the handwritten digits data by applying one-class SVM with Gaussian kernel. Y is the known label set. Y {1,2,4,8} {3,5,7,9} {4,6,8,10} {5,7,9,11} {6,10,12,13} AUC 0.78 0.79 0.82 0.73 0.78 Y {1,2,3,4,5,6,7,8} {3,4,5,6,7,8,9,10} {5,6,7,8,9,10,11,12} {1,7,8,9,10,11,12,13} {1,2,3,9,10,11,12,13} AUC 0.85 0.82 0.75 0.70 0.60 Table 5.8: Average AUCs for the birdsong data by applying one-class SVM. Y is the known label set. 22 1 True Positive Rate 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 False Positive Rate 0.8 1 0.8 1 (a) 1 True Positive Rate 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 False Positive Rate (b) Figure 5.1: Typical examples of ROCs from the handwritten digit data. The subfigure (a) shows a ROC example from the first setting and the subfigure (b) gives an example from the second setting. 23 1 0.95 AUC 0.9 0.85 0.8 0.75 0.7 0 0.1 0.2 0.3 0.4 Novelty Ratio 0.5 0.6 Figure 5.2: Variation of AUC as the ratio of novel instances changes. The error bars stand for standard deviation. 24 Chapter 6: CONCLUSIONS 6.1 Summary In this report, we proposed a new problem – novelty detection in the MIML setting and offered a framework based on score functions to solve the problem. A large number of simulations show that our algorithm not only works well on synthetic data but also on real-world data. Comparison with one-class SVM also shows the superiority of our method. We demonstrate that the presence of unlabeled examples in the training set is useful to detect new class examples while testing. We present the advantage in the MIML setting for novelty detection. Even though positive examples for the novelty that are not directly labeled, their presence provides a clear advantage over methods that rely on data that does not include novel class examples. 6.2 Contributions To the best of our knowledge, novelty detection in the MIML setting has not been investigated yet. Our main contributions are: • We raise a novel problem – novelty detection in the MIML setting. • We produce a framework based on score functions and provide a practical algorithm to solve the problem. 25 • We illustrate the efficacy of our method not only on the synthetic hand-written digit data, but also on a real-world MIML bioacoustics data. 6.3 Publications The following is a list of publications that result from the author’s MS research. 1. Qi Lou, Raviv Raich, Forrest Briggs, Xiaoli Z. Fern, “Novelty Detection Under Multi-Label Multi-Instance Framework”, accepted to IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2013. 2. Forrest Briggs, Xiaoli Fern, Raviv Raich, Qi Lou, “Instance Annotation for Multiinstance Multi-Label Learning”, accepted to ACM Transactions on Knowledge Discovery from Data (TKDD), 2013. 6.4 Future Work There are many related problems that call for further investigation. One is on how to incorporate bag-level label information in detection. In our setting, we only make use of bag-level information in training not in the following detection since such information may not be available in this phase. However, if bag-level labels are always available, we may use such information properly in the detection phase so that it can rule out some possibility that the classifier makes some obvious mistakes. This will possibly improve the performance of our algorithm. Theoretical work such as PAC-Learnability might be necessary to guarantee the fea- 26 sibility of novelty detection in the MIML setting. Currently we are not sure whether bag-level information is enough for detection in instance-level or under what conditions, such detection is possible. We propose to explore applications of our approach on many other real-world datasets naturally equipped with the MIML structure. For example, in computer vision, images are often provided with tags which can indicate the presence of objects in the images. By extracting bag-level information from each image (e.g., using patch representation), a collection of images can be represented in the MIML setting. We suggest using our method interactively with active learning approaches to classify whether an instance (e.g., a patch or a region) is a previously seen type or a new one. 27 Bibliography [1] Markos Markou and Sameer Singh, “Novelty detection: A review - part 1: Statistical approaches,” Signal Processing, vol. 83, pp. 2481–2497, 2003. [2] Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang, and Yu-Feng Li, “Multi-instance multi-label learning,” Artif. Intell., vol. 176, no. 1, pp. 2291–2320, 2012. [3] Forrest Briggs, Xiaoli Z. Fern, and Raviv Raich, “Rank-loss support instance machines for miml instance annotation,” in KDD, 2012. [4] Varun Chandola, Arindam Banerjee, and Vipin Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, July 2009. [5] Markos Markou and Sameer Singh, “Novelty detection: A review - part 2: Neural network based approaches,” Signal Processing, vol. 83, pp. 2499–2521, 2003. [6] G. Manson, G. Pierce, K. Worden, T. Monnier, P. Guy, and K. Atherton, “Long term stability of normal condition data for novelty detection,” in Proceedings of 7th International Symposium on Smart Structures and Materials, California, 2000. [7] Jorma Laurikkala, Martti Juhola, and Erna Kentala, “Informal identification of outliers in medical data,” 2000. [8] M.J. Desforges, P.J. Jacob, and J.E. Cooper, “Applications of probability density estimation to the detection of abnormal conditions in engineering,” in Proceedings of Institute of Mechanical Engineers, 1998. [9] Martin E. Hellman, “The nearest neighbor classification rule with a reject option.,” IEEE Trans. Systems Science and Cybernetics, vol. 6, no. 3, pp. 179–185, 1970. [10] Chris M. Bishop, “Novelty detection and neural network validation,” 1994. [11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404. [12] Sameer Singh and Markos Markou, “An approach to novelty detection applied to the classification of image regions,” IEEE Trans. on Knowl. and Data Eng., vol. 16, no. 4, pp. 396–407, Apr. 2004. 28 [13] Bernhard Schölkopf, Robert C. Williamson, Alex J. Smola, John Shawe-Taylor, and John C. Platt, “Support vector method for novelty detection,” in NIPS, 1999, pp. 582–588. [14] Alfred O. Hero, “Geometric entropy minimization (gem) for anomaly detection and localization,” in NIPS. 2006, pp. 585–592, MIT Press. [15] Kumar Sricharan and Alfred O. Hero, “Efficient anomaly detection using bipartite k-nn graphs,” in NIPS, 2011, pp. 478–486. [16] Manqi Zhao and Venkatesh Saligrama, “Anomaly detection with score functions based on nearest neighbor graphs,” in NIPS, 2009, pp. 2250–2258. [17] Victoria Hodge and Jim Austin, “A survey of outlier detection methodologies,” Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, Oct. 2004. [18] Malik Agyemang, Ken Barker, and Rada Alhajj, “A comprehensive survey of numeric and symbolic outlier mining techniques,” Intell. Data Anal., vol. 10, no. 6, pp. 521–538, Dec. 2006. [19] Animesh Patcha and Jung-Min Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Comput. Netw., vol. 51, no. 12, pp. 3448–3470, Aug. 2007. [20] Zhi hua Zhou and Min ling Zhang, “Multi-instance multilabel learning with application to scene classification,” in In Advances in Neural Information Processing Systems 19, 2007. [21] Zheng-Jun Zha, Xian-Sheng Hua, Tao Mei, Jingdong Wang, Guo-Jun Qi, and Zengfu Wang, “Joint multi-label multi-instance learning for image classification,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA, 2008. [22] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning, “Multi-instance multi-label learning for relation extraction,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, EMNLP-CoNLL ’12, pp. 455–465. [23] Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel, “Representations of quasi-newton matrices and their use in limited memory methods,” 1994. [24] Oksana Yakhnenko and Vasant Honavar, “Multi-instance multi-label learning for image classification with large vocabularies,” in Proceedings of the British Machine Vision Conference. 2011, pp. 59.1–59.12, BMVA Press. 29 [25] Forrest Briggs, Xiaoli Z. Fern, Raviv Raich, and Qi Lou, “Instance annotation for multi-instance multi-label learning,” Transactions on Knowledge Discovery from Data (TKDD), 2012. [26] Li-Ping Liu and Thomas G. Dietterich, “A conditional multinomial mixture model for superset label learning,” in NIPS, 2012, pp. 557–565. [27] Bernhard Schölkopf, John C. Platt, John Shawe-taylor, Alex J. Smola, and Robert C. Williamson, “Estimating the support of a high-dimensional distribution,” 1999. [28] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011. [29] Larry M. Manevitz, Malik Yousef, Nello Cristianini, John Shawe-taylor, and Bob Williamson, “One-class svms for document classification,” Journal of Machine Learning Research, vol. 2, pp. 139–154, 2001. [30] Andrew P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, pp. 1145–1159, 1997. [31] Thomas A. Lasko, Jui G. Bhagwat, Kelly H. Zou, and Lucila Ohno-Machado, “The use of receiver operating characteristic curves in biomedical informatics,” J. of Biomedical Informatics, vol. 38, no. 5, pp. 404–415, Oct. 2005. [32] M. H. Zweig and G. Campbell, “Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.,” Clinical chemistry, vol. 39, no. 4, pp. 561–577, Apr. 1993. 30 APPENDICES 31 Appendix A: One-class SVM To make this report self-contained, we provide the details of one-class SVM here, which are mainly based on [27, 29]. We suppose that samples are generated from an underlying distribution in the feature space. The general idea of one-class SVM is to find a subset S of the feature space that a new sample lies outside this region with an apriori-specified probability ν ∈ (0, 1). We expect S to be as small as possible. A classification function f takes value +1 in S and −1 outside. It is different from the two-class problem since only one-class examples are available in the training phase. Practically, for a given dataset {x1 , x2 , . . . , xn }, the algorithm does as follows: • First, it maps the data into a feature space via feature map Φ(·). • Second, it separates the data from the origin using a hyperplane which is given by < w, Φ(x) >= ρ. • Third, it penalizes outliers by adding the slack variables ξi . • Last, it trades off between model complexity and training errors via ν. Mathematically, it suffices to solve the following optimization problem: n 1 ∑ 1 2 ξi − ρ min ∥w∥ + 2 nν i=1 32 s.t. < w, Φ(xi ) > ≥ ρ − ξi ξi ≥ 0 i = 1, 2, . . . , n This formulation can be easily kernelized as regular SVM when we check its dual formulation. One can use LIBSVM [28] to solve this problem as we did in our implementation. The decision function f is given by f (x) = sgn(< w, Φ(x) > −ρ). 33 Appendix B: ROC and AUC Receiver operating characteristic (ROC) is a standard indicator that illustrates the performance of a binary classifier when the threshold varies. It plots the true positive rate (TPR) versus false positive rate at various threshold settings. It provides a powerful tool in model selection and decision making, hence it is widely used in many fields such as machine learning [30], biomedical informatics [31], clinical medicine [32], etc. We list the basic concepts in ROC analysis here. Note that in a binary classification problem, the examples are labeled as positive (P) or negative (N). The classifier will also give a predicted label to an example. Hence, • TP: true positive. If the predicted label is positive and the actual label is also positive. • TN: true negative. If the predicted label is negative and the actual label is also negative. • FP: false positive. If the predicted label is positive and however the actual label is negative. • FN: false negative. If the predicted label is negative and however the actual label is positive. 34 The TPR and FPR are calculated by TPR = TP/(TP+FN) FPR = FP/(FP+TN) From the above formula we observe that TPR is percentage of the number of examples correctly predicted as positive in the total number of positive examples. FPR is percentage of the number of examples wrongly predicted as positive in the total number of negative examples. For a detector, our goal is that with a low FPR, it returns a high TPR. This is directly related to the area under a ROC curve, i.e., its AUC. Generally, a ROC curve is more ideal, its AUC will be closer to 1, which is the supremum. It is not straightforward to compare two ROC curves, but comparison with their AUCs is quite simple. Hence, the AUC of a ROC becomes a very useful indicator. Many numerical methods can be used to calculate AUC. In MATLAB, one can invoke the ‘trapz’ function1 which applies the trapezoidal method. 1 http://www.mathworks.com/help/matlab/ref/trapz.html