1 Ambiguity-Based Multiclass Active Learning Ran Wang, Member, IEEE, Chi-Yin Chow, Member, IEEE, and Sam Kwong, Fellow, IEEE Abstract—Most existing works on active learning (AL) focus on binary classification problems, which limit their applications in various real-world scenarios. One solution to multiclass AL (MAL) is evaluating the informativeness of unlabeled samples by an uncertainty model, and selecting the most uncertain one for query. In this paper, an ambiguity-based strategy is proposed to tackle this problem by applying possibility approach. First, the possibilistic memberships of unlabeled samples in the multiple classes are calculated from the one-against-all (OAA)based support vector machine (SVM) model. Then, by employing fuzzy logic operators, these memberships are aggregated into a new concept named k-order ambiguity, which estimates the risk of labeling a sample among k classes. Afterwards, the k-order ambiguities are used to form an overall ambiguity measure to evaluate the uncertainty of the unlabeled samples. Finally, the sample with the maximum ambiguity is selected for query, and a new MAL strategy is developed. Experiments demonstrate the feasibility and effectiveness of the proposed method. Index Terms—Active learning, ambiguity, fuzzy sets and fuzzy logic, possibility approach, multiclass. I. I NTRODUCTION CTIVE learning (AL) [1], known as a revised supervised learning scheme, adopts the selective sampling manner to collect a sufficiently large training set. It iteratively selects the informative unlabeled samples for query, and constructs a high-performance classifier by labeling as few samples as possible. A widely used base classifier in AL is support vector machine (SVM) [2], which is a binary classification technique based on statistical learning theory. Under binary settings, many successful AL strategies have been proposed for SVM, such as uncertainty reduction [3], [4], version space (VS) reduction [5], minimum expected model loss [6], etc. However, extending these strategies to multiclass problems is still a challenging issue due to the following two reasons. A Existing SVM-based multiclass models, such as one-againstall (OAA) [7] or one-against-one (OAO) [8], decompose the multiclass problem into a set of binary problems. These models evaluate the informativeness of unlabeled samples by aggregating the output decision values of the binary SVMs, or by estimating the class probabilities of the samples. Possibility theory [9], as an extension of fuzzy sets and fuzzy logic [10], is a commonly used technique for dealing with vagueness and imprecision about the given information. It has a great potential for solving AL problems, especially for SVM under multiclass environments. First, possibility approach is able to evaluate the uncertainty of unlabeled samples by aggregating a set of class memberships. This is intrinsically compatible with MAL. Besides, the memberships in possibility theory can be independent. Under this condition, SVM-based model can compute the memberships by decomposing a multiclass problem into a set of binary problems, with one for each class. Second, possibility approach is based on an ordering structure rather than an additive structure. This feature makes it less rigorous in measuring the unlabeled samples for SVM-based model. For instance, in probability approach, it has to consider the pairwise relations of all the classes to satisfy the additive property. However, possibility approach can relax the additive property and enable SVM to compute the memberships in a simpler way. Traditional SVMs are binary classifiers. In order to realize multiclass AL (MAL), it is necessary to construct an effective multiclass SVM model. Designing a sample selection criterion for multiple classes is much more complicated than two classes. For instance, the size of VS for a binary SVM is easy to calculate, while the size of VS for multiple SVMs is hard to define. To the best of our knowledge, the applications of possibility approach to AL have not been investigated. In doing so, a new pool-based MAL strategy is proposed. First, the possibilistic memberships of unlabeled samples are calculated from OAAbased SVM model. Then, fuzzy logic operators [11], [12], [13] are employed to aggregate the memberships into a new concept named k-order ambiguity, which estimates the risk of labeling a sample among k classes. Finally, a new uncertainty measurement is proposed by integrating the k-order ambiguities. It is noteworthy that possibility approach has no difference with probability approach for binary SVM, since the positive and negative memberships are always complementary. However, possibility approach is more flexible for multiclass SVM, since it defines an ordering structure with independent memberships. This work is partially supported by Hong Kong RGC General Research Fund GRF grant 9042038 (CityU 11205314); the National Natural Science Foundation of China under the Grant 61402460; and China Postdoctoral Science Foundation Funded Project under Grant 2015M572386. R. Wang is with the Department of Computer Science, City University of Hong Kong, Hong Kong, and also with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China (e-mail: ranwang3-c@my.cityu.edu.hk; wangran@siat.ac.cn). C.-Y. Chow and S. Kwong are with the Department of Computer Science, City University of Hong Kong, Hong Kong (e-mail: chiychow@cityu.edu.hk, CSSAMK@cityu.edu.hk). The remainder of this paper is organized as follows. In Section II, we present some related works. In Section III, we design the ambiguity measure and prove its basic properties, then we establish the ambiguity-based multiclass active learning strategy. In Section IV, we conduct extensive experimental comparisons to show the feasibility and effectiveness of the proposed method. Finally, conclusions and future work directions are given in Section V. • • 2 II. BACKGROUNDS AND R ELATED W ORKS Researchers have made some efforts to realize SVM-based MAL. A number of works adopt OAA approach to construct the base models [5], [14], [15], [16], and design the sample selection criterion by aggregating the output decision values of the binary SVMs. Specifically, Tong [5] proposed to evaluate the uncertainty of an unlabeled sample by aggregating its distances to all the SVM hyperplanes. Later, he proposed to evaluate the model loss by aggregating the VS areas of the binary SVMs after having queried an unlabeled sample and received a certain label. This method was also discussed in [15] and [16] by designing a more effective approximation method for the VS areas. Moreover, Hospedales et al. [14] proposed to evaluate the unlabeled samples by both generative and discriminative models, in order to achieve both accurate prediction and class discovery. In addition to OAA approach, OAO approach is also effective to construct multiclass SVM models. For instance, Joshi et al. [17] proposed a scalable AL strategy for multiclass image classification, which estimates the pairwise probabilities of the images by OAO approach, and selects the one with the highest value-of-information. This work also stated that entropy might not be a good sample selection criterion, since the entropy value is highly affected by the probabilities of unimportant classes. Thus, the best vs. second best (BvSB) method was developed, which only makes use of the two largest probabilities. Possibility approach, different from the above techniques, is an uncertainty analysis tool with imprecise probabilities. It is driven by the minimal specificity principle, which states that any hypothesis not known to be impossible cannot be ruled out. Given a C-class problem, assume µi (x) is the membership of sample x in the i-th class (i = 1, . . . , C), then the memberships are said to be possibilistic/fuzzy if PC µi (x) ∈ [0, 1] and probabilistic further if i=1 µi (x) = 1. In the context of AL, if the memberships are possibilistic, then 1) µi (x) = 0 means that class i is rejected as impossible for x; 2) µi (x) = 1 means that class i is totally possible for x; 3) at least one class is totally possible for x, i.e., maxi=1,...,C {µi (x)} = 1. With a normalisation process, condition 3 can be modified as maxi=1,...,C {µi (x)} 6= 0. There are two schemes to handle possibilistic memberships: 1) transform them to probabilities and apply probability approaches; or 2) aggregate them by fuzzy logic operators. It is stated in [18] that transforming possibilities to probabilities or conversely can be useful in many cases. However, they are not equivalent representations. Probabilities are based an additive structure, while possibilities are based on an ordering structure and are more explicit in handling imprecision. Thus, we only focus on the second scheme in this paper. On the other hand, Wang et al. [19] proposed a concept of classification ambiguity to measure the non-specificity of a set, and applied it to the induction of fuzzy decision tree. Given a set R with a number of labeled samples, its classification ambiguity is defined as: Ambiguity(R) = XC i=1 (p∗i − p∗i+1 ) log i, (1) where (p1 , . . . , pC ) is the class frequency in R, and (p∗1 , . . . , p∗C , p∗C+1 ) is the normalisation of (p1 , . . . , pC , 0) with descending order, i.e., 1 = p∗1 ≥ . . . ≥ p∗C+1 = 0. Later, Wang et al. [20] applied the same measure to the induction of extreme learning machine tree, and used it for attribute selection during the induction process. It is noteworthy that both [19] and [20] applied the ambiguity measure to a set of probabilities. However, applying this concept to a set of possibilities might be more effective. Besides, other than measuring the non-specificity of a set, it is also potential to measure the uncertainty of an unlabeled sample. Motivated by the above statements, in this paper, we will apply possibility approach to MAL, and develop an ambiguitybased strategy for SVM. III. A MBIGUITY-BASED M ULTICLASS ACTIVE L EARNING A. Ambiguity Measure In fuzzy theory, fuzzy logic provides functions for aggregating fuzzy sets and fuzzy relations. These functions are called aggregation operators. In MAL, if we treat the possibilistic memberships of unlabeled samples as fuzzy sets, then the aggregation operators in fuzzy logic could be used to evaluate the informativeness of these samples. Given a set of possibilistic memberships {µ1 , . . . , µC } where µi ∈ [0, 1], i = 1, . . . , C, Frélicot and Mascarilla [11] proposed the fuzzy OR-2 aggregation operator ▽(2) as: (2) ▽i=1,...,C µi = ▽i=1,...,C µi △ ▽j6=i µj , (2) where (△, ▽) is a dual pair of t-norm and t-conorm. It is demonstrated in [12] that when the standard t-norm a△ b = min{a, b} and standard t-conorm a ▽ b = max{a, b} are (2) selected, Eq. (2) has the property of ▽i=1,...,C µi = µ′2 , where µ′2 is the second largest membership among {µ1 , . . . , µC }. Based on this property, they proposed a specific fuzzy OR(2) 2 operator, i.e., ▽i=1,...,C µi = △i=1,...,C ▽j6=i µj , and proved its properties of continuous, monotonic, and symmetric, under some boundary conditions. Mascarilla et al. [13] further extended Eq. (2) to a generalized version of k-order fuzzy OR operator. Let C = {1, . . . , C}, P(C) be the power set of C, and Pk = {A ∈ P(C) : |A| = k} where |A| is the cardinality of subset A, the k-order fuzzy OR operator ▽(k) is defined as: (k) ▽i=1,...,C µi = △A∈Pk−1 ▽j∈C\A µj , (3) where (△, ▽) is a dual pair of t-norm and t-conorm, and k ∈ {2, . . . , C}. By theoretical proof, they also demonstrated that when the standard t-norm a △ b = min{a, b} and standard t-conorm a ▽ b = max{a, b} are selected, Eq. (3) has the property of (k) ▽i=1,...,C µi = µ′k , where µ′k is the k-th largest membership among {µ1 , . . . , µC }. It is noteworthy that there are various combinations of aggregation operators and t-norms, but the study on them is 3 (2) (2) AC (x) = ▽i=1,...,C µi (x) ▽i=1,...,C µi (x). (4) Eq. (4) reflects the risk of labeling x between two classes, i.e., class i ∈ C and the mostly preferred class from the others C \ i. By applying standard t-norm and standard t-conorm, Eq. (4) is equal to making a comparison between the largest membership and the second largest membership. In this paper, we extend the 2-order ambiguity measure to a generalized version of k-order ambiguity measure as: (k) AC (x) = (k) ▽i=1,...,C µi (x) ▽i=1,...,C µi (x). (5) Similarly, Eq. (5) reflects the risk of labeling x among k classes, i.e., classes A ∈ P(C) : |A| = k − 1 and the mostly preferred class from the others C \ A. By applying standard t-norm and standard t-conorm, Eq. (5) is equal to making a comparison between the largest membership and the k-th largest membership. In order to get the precise uncertainty information of x, (k) we have to consider all the ambiguities, i.e., AC (x), k = 2, . . . , C. The most efficient way is to aggregate them by 1 γ=1 γ=2 γ=3 γ=4 γ=5 0.8 (log k−log(k−1))γ (log 2−log 1)γ not the focus of this work. For simplicity, we adopt the standard t-norm and standard t-conorm with the aforementioned aggregation operators. In decision theory, ambiguity indicates the risk of classifying a sample based on its memberships. Furthermore, a larger membership has a higher influence on the risk. Obviously, the larger ambiguity, the higher difficulty in classifying the sample. In this section, we will design an ambiguity measure to achieve this purpose. For the sake of clarity, we start from a set of axioms. Given a sample x with a set of possibilistic memberships {µ1 (x), . . . , µC (x)}, where µi (x) ∈ [0, 1] (i = 1, . . . , C) is the possibility of x in the i-th class. The ambiguity measure on x, denoted as AC (x) = A(µ1 (x), . . . , µC (x)), is a continuous mapping [0, 1]C → [0, a] (where a ∈ R+ ) satisfying the following three axioms: 1) Symmetry: AC (x) is a symmetric function of µ1 (x), . . . , µC (x). 2) Monotonicity: AC (x) is monotonically decreasing in max{µi (x)}, and is monotonically increasing in other µi (x). 3) Boundary condition: AC (x) = a when µ1 (x) = µ2 (x) = . . . = µC (x) 6= 0; AC (x) = 0 when max{µi (x)} 6= 0 and µi (x) = 0 otherwise. According to Axiom 1, the ambiguity value keeps the same with regard to any permutation of the memberships. According to Axiom 2, the increase of the greatest membership and the decrease of the other memberships will lead to a smaller ambiguity, i.e., the classification risk on a sample is lower when the differences between its greatest membership and the other memberships are larger. According to Axiom 3, the classification on a sample is most difficult when it equally belongs to all the classes, and is easiest when only the greatest membership is not zero. Frélicot et al. [12] applied the fuzzy OR-2 operator to classification problem, and proposed the 2-order ambiguity, to evaluate the classification risk of x, which is defined as: 0.6 0.4 0.2 0 2 3 4 5 6 7 8 9 10 11 k Fig. 1: Values of (log k−log(k−1))γ (log 2−log 1)γ with different γ. weighted sum. As a result, we propose an overall ambiguity measure as Definition 1. Definition 1: (Ambiguity) Given a sample x with a set of possibilistic memberships {µ1 (x), . . . , µC (x)}, where µi (x) ∈ [0, 1] (i = 1, . . . , C) is the possibility of x in the i-th class, then, the ambiguity of x is defined as: AC (x) = XC k=2 (k) wk AC (x), (6) where wk is the weight for the k-order ambiguity. It is a consensus that in classifying a sample, the large memberships are critical, and the small memberships are less important. With standard t-norm and standard t-conorm, the korder ambiguity is proportional to the k-th largest membership. As a result, the 2-order ambiguity should be given the highest weight, and the C-order ambiguity should be given the lowest weight. In this case, we propose to use a nonlinear weight function wk = (log k − log(k − 1))γ , since 1) it is positive decreasing in [0, +∞], and 2) it can give higher importance to the large memberships. In this weight function, the scale factor γ is a real positive integer. Fig. 1 demonstrates the values of (log k−log(k−1))γ when γ = 1, . . . , 5, which equals to getting γ (log 2−log 1) the normalised weights with w1 = 1. Obviously, a larger γ can further magnify the importance of the large memberships, i.e., the 2-order ambiguity will become even more important and the C-order ambiguity will become even less important. B. Properties of the Ambiguity Measure Fig. 2 demonstrates the ambiguity value when C = 2 and γ = 1. Under these conditions, the ambiguity has several common features. It is symmetric about µ1 (x) = µ2 (x), strictly decreasing in max{µ1 (x), µ2 (x)}, strictly increasing in min{µ1 (x), µ2 (x)}, and concave. Besides, it attains the maximum at µ1 (x) = µ2 (x) and minimum at min{µ1 (x), µ2 (x)} = 0. Having these observations, we further give some general properties of the ambiguity measure by relaxing the conditions of C = 2 and γ = 1. For the sake of clarity, we denote µi (x) as µi in short, and let µ′1 ≥ . . . ≥ µ′C be the sequence of µ1 , . . . , µC in descending order. Theorem 1: AC (x) is a symmetric function of µ1 , . . . , µC . 4 where µ∗i = µ′i /µ′1 . Theorems 1∼3 show that the three basic axioms given in Section III-A are satisfied by the proposed ambiguity measure, and Theorem 4 demonstrates that the proposed measure is a generalized and extended version of Eq. (1) proposed in [19]. 0.8 2 A (x) 0.6 0.4 0.2 0 1 C. Fuzzy Memberships of Unlabeled Sample 0.8 0.6 µ (x) 2 0.4 0.2 0 0.2 0 0.4 0.6 0.8 1 µ (x) 1 Fig. 2: Ambiguity value when C = 2 and γ = 1. Proof: Let µ′1 ≥ . . . ≥ µ′C be the sequence of µ1 , . . . , µC in descending order, then PC (k) AC (x) = k=2 AC (x)(log k − log(k − 1))γ PC (k) γ = k=2 (▽i=1,...,C µi / ▽i=1,...,C µi ) (log k − log(k − 1)) PC = k=2 (µ′k /µ′1 )(log k − log(k − 1))γ µ′ µ′ = µ′2 (log 2 − log 1)γ + µ′3 (log 3 − log 2)γ + . . . 1 1 µ′ + µC′ (log C − log (C − 1))γ . 1 (7) Given any permutation of µ1 , . . . , µC , the order µ′1 ≥ . . . ≥ µ′C keeps unchanged. Based on Eq. (7) and the definition of symmetric function, the proof is straightforward. Theorem 2: AC (x) will decrease when µ′1 increases and all the others are unchanged; AC (x) will increase when µ′i , i ∈ {2, . . . , C} increases and all the others are unchanged. Proof: Follow the expression in Eq. (7), when i = 1, µ′3 µ′2 ∂AC (x) γ (log 2 − log 1) − (log 3 − log 2)γ − . . . − = − ′ ′2 ′2 ∂µ µ µ 1 1 µ′C (log C − log (C − 1))γ < 0; µ′2 1 1 γ µ′1 (log i − log (i − 1)) > 0. 1 when i = 2, . . . , C, ∂AC (x) ∂µ′i = Theorem 3: AC (x) attains its maximum at µ′1 = . . . = 6= 0, and minimum at µ′1 6= 0, µ′C\1 = 0. µ′C Proof: When µ′1 = . . . = µ′C 6= 0, it is clear that ... = µ′2 µ′1 µ′C µ′1 µ′2 µ′1 = = 1; when µ′1 6= 0 and µ′C\1 = 0, it is clear that µ′ = . . . = µC′ = 0. Since µ′1 ≥ . . . ≥ µ′C , follow the 1 expression in Eq. (7), the proof is straightforward. Theorem 4: When γ is set to 1, AC (x) has the same form with Eq. (1). Proof: Follow the expression in Eq. (7), suppose µ′C+1 = 0, when γ = 1 we have: AC (x) = = = = µ′3 µ′2 µ′1 (log 2 − log 1) + µ′1 (log 3 − log 2) + . . . µ′ + µC′ (log C − log (C − 1)) 1 µ′ µ′ µ′ − µ′2 log 1 + ( µ2′ − µ′3 ) log 2 + . . . 1 1 1 µ′ µ′ µ′ +( µC′ − C−1 ) log (C − 1) + µC′ log C ′ µ 1 1 1 µ′ µ′ µ′ µ′ ( µ1′ − µ′2 ) log 1 + ( µ2′ − µ′3 ) log 2 + . . . 1 1 1 1 µ′ µ′ +( µC′ − C+1 ) log C ′ PC 1 ∗ µ1 ∗ i=1 (µi − µi+1 ) log i, Given a binary training set {(xi , yi )}ni=1 ∈ Rm ×{+1, −1}, the SVM hyperplane is defined as wT x + b = 0, where w ∈ Rm and b ∈ R. The linearly separable case is formulated as minw,b 12 wT w, s.t. yi (wT xi + b) ≥ 1, i = 1, . . . , n. The nonlinearly separable case could be handled by soft-margin SVM, transforms the formulation into Pwhich n minw,b,ξ 12 wT w+θ i=1 ξi , s.t. yi (wT xi +b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n, where ξi is the slack variable introduced to xi , and θ is a trade-off between the maximum margin and minimum training error. Besides, kernel trick [2] is adopted, which maps the samples into higher dimensional feature space via φ : x → φ(x), and expresses the inner product of feature space as a kernel function K : hφ(x), φ(xi )i = K(x, xi ). By Lagrange Pn method, the decision function is achieved as h(x) = i=1 yi αi K(x, xi ) + b, where αi is the Lagrange multiplier of xi , and the final classifier is f (x) = sign(h(x)). SVM-based multiclass models decompose a multiclass problem into a set of binary problems. Among the solutions, OAA approach is one of the most effective and efficient methods. It constructs C binary classifiers for a C-class problem, where fq (x) = sign(hq (x)), q = 1, . . . , C, separates class q from the rest C − 1 classes. For the testing sample x̂ ∈ Rm , the output class is determined as ŷ = argmaxq=1,...,C (hq (x̂)). In a binary SVM, the absolute decision value of a sample is proportional to its distance to the SVM hyperplane, and a larger distance represents a higher degree of certainty in its label. In OAA-based SVM model, the membership of x in class q could be computed by logistic function [21], i.e., 1 , which has the following properties: µq (x) = 1+e−h q (x) When hq (x) > 0, µq (x) ∈ (0.5, 1], and µq (x) will increase with the increase of |hq (x)|; • When hq (x) < 0, µq (x) ∈ [0, 0.5), and µq (x) will decrease with the increase of |hq (x)|; • When hq (x) = 0, x lies on the decision boundary, and µq (x) = 0.5. Obviously, each membership is independently defined for a specific class. Since the memberships are possibilistic, the problem is suitable to be handled by possibility theory. • D. Algorithm Description By applying the ambiguity measure and the fuzzy membership calculation, the ambiguity-based MAL strategy is depicted as Algorithm 1. We now give an analysis on the time complexity of selecting one sample in Algorithm 1. Given an iteration, suppose the number of labeled training samples is l, and the number of unlabeled samples is u. Based on [22], training a radial basis function (RBF) kernel-based SVM has the highest complexity of O(s3 ), and making prediction for one testing sample has 5 Algorithm 1: Ambiguity-Based MAL Input: • Initial labeled set L = {(xi , yi )}li=1 ∈ Rm × {1, . . . , C}; u m • Unlabeled pool U = {(xj )}j=1 ∈ R ; • Scale factor γ and parameters for training base-classifiers. Output: • Classifiers f1 , . . . , fC trained on the final labeled set. 1 Learn C SVM hyperplanes h1 , . . . , hC on L by the OAA approach, one for each class; 2 while U is not empty do 3 if stop criterion is met then 4 Let f1 = sign(h1 ), . . . , fC = sign(hC ); 5 return f1 , . . . , fC ; 6 else 7 for each xj ∈ U do 8 Calculate its decision values by the SVMs, i.e., hq (xj ), q = 1 . . . , C; 9 Calculate its fuzzy memberships in the classes, i.e., 1 µq (xj ) = , q = 1, . . . , C; 1+e−hq (xj ) 10 Calculate its ambiguity based on Eq. (6), i.e., AC (xj ) ; 11 end 12 Find the unlabeled sample with the maximum ambiguity, i.e., x∗ = argmaxxj AC (xj ); 13 Query the label of x∗ , denoted by y ∗ ; 14 Let U = U \ x∗ , and L = L ∪ (x∗ , y ∗ ); 15 Update h1 , . . . , hC based on L; 16 end 17 end 18 return f1 , . . . , fC ; 1) Random Sampling (Random): During each iteration, the learner randomly selects an unlabeled sample for query. The OAA approach is adopted to train the base classifiers. 2) Margin-Based Strategy (Margin) [5]: With the OAA approach, the margin values of the C bianry QCSVMs are aggregated into an overall margin, i.e., m(x) = q=1 |hq (x)|, and the learner selects the one with the minimum aggregated margin, i.e., x∗ = argminx m(x). 3) Version Space Reduction (VS Reduction) [15], [16]: With the OAA approach, assume the original VS area of hq is Area(V (q) ), and the new area after querying sample x is (q) Area(Vx ), then an approximation method is applied, i.e., (q) |h (x)|+1 )Area(V (q) ). Finally, the sample is Area(Vx ) ≈ ( q 2 QC (q) ∗ selected by x = argminx q=1 Area(Vx ). 4) Entropy-Based Strategy (Entropy) [17]: This method is based on probability theory. The OAO approach is adopted, which constructs C(C − 1)/2 binary classifiers, and each one is for a pair of classes. The classifier of class q against class g is defined as hq,g (x) when q < g, where x belongs to class q if hq,g (x) > 0 and class g if hq,g (x) < 0. Besides, hq,g (x) = −hg,q (x) when q > g. The pairwise probabilities of x regarding classes q and g are derived as rq,g (x) = 1+e−h1q,g (x) when q < g, and rq,g (x) = 1 − rg,q (x) when q > g. The P probability of x in class q is calculated PC 2 C g6=q,g=1 rq,g as pq (x) = . Obviously, q=1 pq (x) = 1. C(C−1) Finally, the sample with the maximum entropy is selected, P i.e., x∗ = argmaxx − C q=1 pq (x) log pq (x). 5) Best vs. Second Best (BvSB) [17]: This method applies the same probability estimation process as Entropy, and only makes use of the two most important classes. Assume the largest and second largest class probabilities of sample x are p∗1 (x) and p∗2 (x) respectively, then the most informative sample is selected by x∗ = argminx (p∗1 (x) − p∗2 (x)). B. Experimental Design the complexity of O(sm), where s is the number of support vectors (SVs) and m is the input dimension. Thus, training the C binary SVMs (line 1) leads to the highest complexity of O(Cl3 ), calculating the decision values of the u unlabeled samples (line 8) leads to the highest complexity of O(uClm), and calculating the ambiguity values of the u unlabeled samples (lines 9∼10) leads to the highest complexity of O(uC 2 ). Furthermore, finding the sample with the maximum ambiguity (line 12) leads to a complexity of O(u). Finally, the complexity for selecting one sample in Algorithm 1 is computed as O(Cl3 ) + O(uClm) + O(uC 2 ) + O(u) ≈ O(Cl3 ) + O(uClm) = O(Cl(l2 + Cm)). It is noteworthy that this complexity is the highest possible one when all the training samples are supposed to be the SVs. IV. E XPERIMENTAL C OMPARISONS A. Comparative Methods Five MAL strategies are used in this paper to compare with the proposed algorithm (Ambiguity). The experiments are first conducted on 12 multiclass UCI machine learning datasets as listed in Table I. Since the testing samples are not available for Glass, Cotton, Libras, Dermatology, Ecoli, Yeast, and Letter, in order to have a sufficiently large unlabeled pool, 90% data are randomly selected as the training set and the rest 10% as the testing set. Each input feature is normalised to [0,1]. The initial training set is formed by two randomly chosen labeled samples from each class, and the learning stops after 100 new samples have been labeled or the selective pool becomes empty. To avoid the random effect, 50 trials are conducted on the datasets with less than 2, 000 samples, and 10 trials are conducted on the larger datasets. Finally, the average results are recorded. For fair comparison, θ is fixed as 100 for SVM, and RBF 2 i || kernel K(x, xi ) = exp(− ||x−x ) with σ = 1 is adopted. 2σ2 Besides, γ in Eq. (6) is treated as a parameter, and tuned on the training set. More specifically, the training set X is divided into two subsets, i.e., X1 and X2 , with equal size. Active learning is conducted on X1 by fixing γ = 1, 2, . . . , 10, then, the models are validated on X2 and the best γ value is selected. The selected γ values for the 12 datasets are listed in the last column of Table I. 6 TABLE I: Selected datasets for performance comparison #Feature Feature Type #Class Among the six strategies, Random is a baseline, Margin and VS Reduction directly utilize the output decision values of the SVMs, Entropy and BvSB are probability approaches based on the OAO model, and Ambiguity is a possibility approach based on the OAA model. Fig. 3 demonstrates the average testing accuracy and standard deviation of different trials for the six strategies. The average results on the 12 datasets are shown in Fig. 4. It is clear from these results that Ambiguity has obtained satisfactory performances on all the datasets except Vowel. In fact Vowel is a difficult problem that all the methods fail to achieve an accuracy higher than 50% after the learning stops. Besides, Ambiguity has achieved very similar performance with BvSB in some cases (e.g., datasets Dermatology, Yeast and Letter). This could be due to the fact that when the scale factor γ is large enough, the ambiguity measure can be regarded as only considering the 2-order ambiguity. Since the 2-order ambiguity is just decided by the largest and the second largest memberships, Ambiguity is intrinsically the same with BvSB in this case. Furthermore, Ambiguity has achieved low standard deviation on most datasets. However, all the methods have shown fluctuant standard deviations on datasets Letter and Pen. This could be caused by the large size of the datasets and the small number of trials on them. Another typical phenomenon observed from Fig. 3 is that Entropy, which is an effective uncertainty measurement for many problems, has performed worse than Ambiguity and even worse than Random in many cases. In order to find the reason, we make an investigation on the learning process of dataset Ecoli. Fig. 5 demonstrates the class possibilities and probabilities of three unlabeled samples in an iteration. It is calculated from the possibilities that the ambiguity values of the three samples are 1.722, 0.966 and 1.040. Obviously, Sample 1 is more benefit to the learning, which will be selected by Ambiguity. It is further calculated from the probabilities that the entropy values of the three samples are 2.953, 2.981 Testing Accuracy (%) C. Empirical Studies 70 / 76 / 17 / 13 / 9 / 29 55 / 49 / 30 / 118 / 77 / 27 24 × 15 112 / 61 / 72 / 49 / 52 / 20 143 / 77 / 2 / 2 / 35 / 20 / 5 / 52 244 / 429 / 463 / 44 / 51 / 163 / 35 / 30 / 20 Note 10 × 9 / 40 × 4 / 20 × 2 / 6 × 2 / 4 / 1 48 × 11 376 / 389 / 380 / 389 / 387 / 376 / 377 / 387 1072 / 479 / 961 / 415 / 470 / 1038 780 / 779 / 780 / 719 / 780 / 720 / 720 / 778 775 / 773 / 734 / 755 / 747 / 739 / 761 / 792 / 80 75 Random Margin VS Reduction Entropy BvSB Ambiguity 70 65 60 5 6 7 7 8 /5 7 10 2 10 / 380 / 382 10 6 / 719 / 719 8 783 / 753 / 803 / 783 / 7 85 0 5 4 3 2 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 6 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples Number of New Training Samples (a) Testing Accuracy (b) Standard Deviation Fig. 4: Average result on the 12 UCI datasets. 0.8 Membership The experiments are performed under MATLAB 7.9.0 with the “svmtrain” and “svmpredict” functions of libsvm, which are executed on a computer with a 3.16-GHz Intel Core 2 Duo CPU, a 4-GB memory, and 64-bit Windows 7 system. γ Class Distribution in Training Set 0.8 Possibility Probability 0.6 0.4 0.2 0 1 2 3 4 5 Class 6 (a) Sample 1 7 8 0.8 Possibility Probability 0.6 Membership #Test Standard Deviation (%) #Train Glass 214 0 10 Real+Integer 6 Cotton 356 0 21 Real 6 Libras 360 0 90 Real 15 Dermatology 366 0 34 Real+Integer 6 Ecoli 366 0 7 Real+Integer 8 Yeast 1,484 0 8 Real+Integer 10 Letter 20,000 0 16 Real 26 Soybean 307 376 35 Real+Integer 19 Vowel 528 462 10 Real 11 Optdigits 3,823 1,797 64 Real+Integer 10 Satellite 4,435 2,000 36 Real 6 Pen 7,494 3,498 16 Real+Integer 10 Note: The class distribution of dataset ”Letter” is 789 / 766 / 736 / 805 / 768 / 758 / 748 / 796 / 813 / 764 / 752 / 787 / 786 / 734. Membership Dataset 0.4 0.2 0 1 2 3 4 5 Class 6 (b) Sample 2 7 8 Possibility Probability 0.6 0.4 0.2 0 1 2 3 4 5 Class 6 7 8 (c) Sample 3 Fig. 5: Class memberships of 3 samples. and 2.951. In this case, Sample 2 will be selected by Entropy. However, this might not be a good selection, since the advantage of Sample 2 over Samples 1 and 3 is too trivial. This example tells that the rigorous computation on class probabilities may weaken the differences of the samples in uncertainty, especially when the number of classes is large, the entropy value is highly affected by the probabilities of unimportant classes. In the context of active learning, possibility approach might be more effective in distinguishing unlabeled samples. Table II reports the mean accuracy and standard deviation of the 100 learning iterations, as well as the final accuracy and the average time for selecting one sample. It is observed that Ambiguity has achieved the highest mean accuracy and final accuracy on 11 and 9 datasets out of 12 respectively. Besides, Ambiguity is much faster than Entropy and BvSB, but slightly slower than Margin and VS Reduction. It is noteworthy that in a real active learning process, labeling a sample usually takes much more time than selecting a sample. For instance, it may take several seconds to several minutes for labeling a sample, while the selecting part just takes milliseconds. 7 60 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (a) Glass (b) Cotton 0 50 35 30 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 40 0 Number of New Training Samples (e) Ecoli 90 80 70 50 40 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 60 Number of New Training Samples 0 10 20 30 40 50 60 70 80 90 100 8 6 4 2 0 Standard Deviation (%) Standard Deviation (%) 10 12 8 6 4 2 10 20 30 40 50 60 70 80 90 100 0 Number of New Training Samples 8 6 4 0 Standard Deviation (%) Standard Deviation (%) Random Margin VS Reduction Entropy BvSB Ambiguity 10 7 10 20 30 40 50 60 70 80 90 100 6 5 4 0 4 3.5 3 2.5 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (u) Vowel Standard Deviation (%) Standard Deviation (%) 4.5 9 8 7 6 0 5 0 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (v) Optdigits 4 3 0 Testing Accuracy (%) (p) Dermatology Random Margin VS Reduction Entropy BvSB Ambiguity 3 2.5 2 1.5 1 0 5 4 3 2 1 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 6 0 (t) Soybean Random Margin VS Reduction Entropy BvSB Ambiguity 6 4 2 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (w) Satellite 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples Number of New Training Samples 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples 7 3.5 8 10 5 2 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 6 (s) Letter Random Margin VS Reduction Entropy BvSB Ambiguity 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples Number of New Training Samples 10 20 30 40 50 60 70 80 90 100 15 75 (l) Pen Random Margin VS Reduction Entropy BvSB Ambiguity 10 5 Random Margin VS Reduction Entropy BvSB Ambiguity 80 7 (r) Yeast Random Margin VS Reduction Entropy BvSB Ambiguity 5 85 70 10 20 30 40 50 60 70 80 90 100 11 Number of New Training Samples (q) Ecoli 90 (o) Libras Random Margin VS Reduction Entropy BvSB Ambiguity 8 3 Number of New Training Samples 5.5 0 (n) Cotton 9 12 Random Margin VS Reduction Entropy BvSB Ambiguity 75 95 Number of New Training Samples 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (h) Soybean 80 Number of New Training Samples (m) Glass 0 (k) Satellite Random Margin VS Reduction Entropy BvSB Ambiguity 10 75 70 10 20 30 40 50 60 70 80 90 100 Random Margin VS Reduction Entropy BvSB Ambiguity 80 100 (j) Optdigits Random Margin VS Reduction Entropy BvSB Ambiguity 12 0 Number of New Training Samples (i) Vowel 14 35 70 Standard Deviation (%) 0 40 85 (g) Letter Standard Deviation (%) 25 Random Margin VS Reduction Entropy BvSB Ambiguity 85 Standard Deviation (%) 30 45 90 Number of New Training Samples Testing Accuracy (%) Random Margin VS Reduction Entropy BvSB Ambiguity 35 Testing Accuracy (%) Testing Accuracy (%) 40 50 30 10 20 30 40 50 60 70 80 90 100 100 45 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (d) Dermatology 55 (f) Yeast 50 92 95 Number of New Training Samples 55 Random Margin VS Reduction Entropy BvSB Ambiguity 94 (c) Libras 55 45 96 90 10 20 30 40 50 60 70 80 90 100 60 Testing Accuracy (%) Testing Accuracy (%) Testing Accuracy (%) Random Margin VS Reduction Entropy BvSB Ambiguity 70 0 98 Number of New Training Samples 60 80 65 40 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples 85 75 50 Testing Accuracy (%) 0 70 Random Margin VS Reduction Entropy BvSB Ambiguity 60 Testing Accuracy (%) 70 Random Margin VS Reduction Entropy BvSB Ambiguity 70 Standard Deviation (%) 75 80 80 Standard Deviation (%) Random Margin VS Reduction Entropy BvSB Ambiguity 100 6 Standard Deviation (%) 80 90 Testing Accuracy (%) 85 65 90 100 90 Testing Accuracy (%) Testing Accuracy (%) 95 Random Margin VS Reduction Entropy BvSB Ambiguity 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 Number of New Training Samples (x) Pen Fig. 3: Experimental comparisons on the selected UCI datasets. (a)∼(l) Testing accuracy. (m)∼(x) Standard deviation. 8 TABLE II: Performance comparison on the selected datasets: mean accuracy (%), standard deviation, final accuracy (%), and average time for selecting one sample (seconds) Dataset mean Random final time Glass Cotton Libras Dermatology Ecoli Yeast Letter Soybean Vowel Optdigits Satellite Pen 85.16±6.05 81.60±5.02 64.93±6.89 95.87±1.01 77.67±2.69 48.66±5.36 45.32±5.10 84.93±4.54 38.47±4.97 84.16±6.35 78.92±2.82 84.77±5.12 91.71 87.22 74.94 97.14 80.76 54.51 53.00 89.65 44.85 90.82 81.00 90.03 0.0074* 0.0139* 0.0381* 0.0146* 0.0084* 0.0206* 0.0465* 0.0262* 0.0179* 0.0328* 0.0110* 0.0175* Avg. 72.54±4.66 77.97 0.0212* mean Margin final time 88.99±7.22 85.03±6.64 66.38±8.96 96.51±1.35 79.47±2.69 48.02±4.41 45.47±4.49 85.87±5.54 37.79±4.46 74.30±3.64 79.33±2.53 88.36±6.49 94.57 91.56 80.06 97.68 82.06 52.51 52.31 92.65 44.11 81.03 82.21 95.01 0.0105 0.0191 0.0724 0.0208 0.0132 0.0362 0.7616 0.0606 0.0238 0.5581 0.1476 0.1162 72.96±4.87 78.81 0.1533 mean VS Reduction final time 89.41±7.56 84.67±7.38 66.41±9.28 96.97±1.36 79.86±2.66 47.51±4.20 46.05±4.97 86.35±5.81 39.24±4.74 75.36±2.79 80.14±3.74 85.07±6.14 94.86 92.11 80.22 97.84 81.88 51.45 53.31 92.58 44.64 80.96 83.34 92.60 0.0112 0.0200 0.0706 0.0224 0.0137 0.0384 0.7852 0.0633 0.0247 0.5012 0.1527 0.1170 73.09±5.05 78.82 0.1517 mean Entropy final time 87.87±4.84 81.17±4.79 66.85±7.95 96.26±1.10 76.94±2.66 43.64±4.12 39.32±1.83 82.13±2.32 38.63±2.92 82.64±10.63 76.98±2.28 84.99±4.93 93.62 87.11 75.11 96.54 79.29 49.74 41.80 86.19 42.34 92.37 79.79 89.56 0.0367 0.0869 0.2391 0.0946 0.0658 0.3970 10.455 0.2305 0.1557 1.4646 0.9483 1.8034 71.45±4.20 76.12 1.3314 mean BvSB final 88.53±4.91 83.89±5.68 67.96±8.31 96.75±1.64 79.55±2.78 49.66±5.11 46.53±5.95 88.89±5.13 43.94±5.36 82.23±11.95 80.99±2.69 89.52±6.21 93.05 89.78 79.28 97.78 81.18 54.77 54.51 92.89 50.26 93.96 83.41 95.35 Ambiguity final time time mean 0.0369 0.0873 0.2341 0.0933 0.0656 0.3984 10.625 0.2261 0.1547 1.4666 0.9027 1.7915 90.62±6.78 86.44±6.80 70.77±8.62 97.17±1.14 80.43±2.58 50.16±4.84 46.53±5.92 90.04±4.42 40.03±5.46 89.53±6.15 81.17±3.18 90.18±6.94 74.87±5.48 80.52 1.3402 94.57 92.72 82.78 97.73 82.29 55.00 55.53 93.29 46.63 95.29 84.09 96.48 0.0112 0.0200 0.0803 0.0243 0.0141 0.0432 0.9271 0.0634 0.0275 0.6073 0.1778 0.1440 76.09±5.23 81.37 0.1783 Note: For each dataset, the highest mean accuracy and final accuracy are in bold face, the minimum time for selecting one sample is marked with *. TABLE III: Paired Wilcoxon’s signed rank tests (p-values) Method Margin VS Reduction Entropy BvSB Ambiguity Random 0.1514 0.0923 0.2036 0.0049† 0.0005† 0.2334 0.1763 0.0923 0.0005† 0.0005† Margin –– 0.1763 0.0923 0.0122† 0.0005† –– 0.5186 0.0342† 0.2661 0.0010† VS –– –– 0.0522 0.0425† 0.0005† Reduction –– –– 0.0269† 0.3013 0.0024† Entropy –– –– –– 0.0010† 0.0005† –– –– –– 0.0010† 0.0005† BvSB –– –– –– –– 0.0269† –– –– –– –– 0.0425† Note: In each comparison, the upper and lower results are respectively the p-values of the Wilcoxon’s signed rank tests on the mean accuracy and final accuracy. For each test, † represents that the two referred methods are significantly different with the significance level 0.05. Assuredly, the time complexity is acceptable. Finally, Table III reports the p values of paired Wilcoxon’s signed rank tests conducted on the accuracy listed in Table II. We adopt the significance level 0.05, i.e., if the p value is smaller than 0.05, the two referred methods are considered as statistically different. It can be seen that Ambiguity is statistically different from all the others by considering both the mean accuracy and final accuracy. D. Handwritten Digits Image Recognition Problem We further conduct experiments on the MNIST handwritten digits image recognition problem1, which aims to distinguish 0 ∼ 9 handwritten digits as shown in Fig. 6(a). This dataset contains 60,000 training samples and 10,000 testing samples from approximately 250 writers, with a relatively balanced class distribution. We use gradient-based method [23] to extract 2,172 features for each sample, and select 68 features by WEKA. Different from the previous experiments, we apply batch-mode active learning on this dataset, which selects multiple samples with high diversity during each iteration. We combine the ambiguity measure with two diversity criteria proposed in [24], which are angle-based diversity (ABD) and enhanced clustering-based diversity (ECBD), and realize two batch-mode active learning strategies, i.e., Ambiguity-ABD and Ambiguity-ECBD. Besides, we compare them with batchmode random sampling (Random), the ambiguity strategy 1 http://yann.lecun.com/exdb/mnist without diversity criteria (Ambiguity), and two strategies in [24] that combine multiclass-level uncertainty (MCLU) with ABD and ECBD, i.e., MCLU-ABD and MCLU-ECBD. The initial training set contains two randomly chosen samples from each class. During each iteration, the learner considers 40 informative samples, and selects the five most diverse ones from them. Besides, the learning stops after 60 iterations, γ is tuned as 9 for all the ambiguity-based strategies, and 10 trials are conducted. The mean accuracy and standard deviation are shown in Figs. 6(b)∼(c). It can be observed that the initial accuracy is just slightly higher than 50%. After 300 new samples (0.5% of the whole set) have been queried, the accuracy has been improved about 30%. Besides, AmbiguityECBD has achieved the best performance, which demonstrates the potential of combining the ambiguity measure with the ECBD criteria. V. C ONCLUSIONS AND F UTURE W ORKS This paper proposed an ambiguity-based MAL strategy by applying possibility approach, and achieved it for OAA-based SVM model. This strategy relaxes the additive property in probability approach. Thus, it computes the memberships in a more flexible way, and evaluates unlabeled samples less rigorously. Experimental results demonstrate that the proposed strategy can achieve satisfactory performances on various multiclass problems. Future developments regarding this work are listed as follows. 1) In the experiment, we treat the scale factor γ as a model parameter and tune it empirically. In the future, it might be useful to discuss how to get the optimal γ based on the characteristics of the dataset. 2) It might be interesting to apply the proposed ambiguity measure to more base classifiers other than SVMs. 3) If we transform a possibility vector into a probability vector or conversely, the existing possibilistic and probabilistic models can be realized in a more flexible way. How to realize an effective and efficient transformation between possibility and probability for MAL will also be one of the future research directions. R EFERENCES [1] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Mach. Learn., vol. 15, no. 2, pp. 201–221, 1994. [2] V. N. Vapnik, The nature of statistical learning theory. Springer Verlag, 2000. 0 4 1 9 2 1 3 1 4 4 0 9 1 1 2 4 3 2 7 1 4 8 4 7 6 9 0 3 4 9 5 8 6 5 1 9 0 3 0 6 85 80 75 70 Random MCLU−ABD MCLU−ECBD Ambiguity Ambiguity−ABD Ambiguity−ECBD 65 60 55 0 9 0 2 6 7 8 3 9 (a) Samples in MNIST dataset 0 4 50 100 150 200 250 300 Number of New Training Samples Standard Deviation (%) 5 Testing Accuracy (%) 9 Random MCLU−ABD MCLU−ECBD Ambiguity Ambiguity−ABD Ambiguity−ECBD 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of New Training Samples (b) Testing Accuracy (c) Standard Deviation Fig. 6: Batch-mode active learning results on MNIST dataset. [3] R. Wang, D. Chen, and S. Kwong, “Fuzzy rough set based active learning,” IEEE Trans. Fuzzy Syst., vol. 22, no. 6, pp. 1699–1704, 2014. [4] R. Wang, S. Kwong, and D. Chen, “Inconsistency-based active learning for support vector machines,” Pattern Recogn., vol. 45, no. 10, pp. 3751– 3767, 2012. [5] S. Tong, “Active learning: theory and applications,” Ph.D. dissertation, Citeseer, 2001. [6] M. Li and I. K. Sethi, “Confidence-based active learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1251–1261, 2006. [7] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, pp. 101–141, 2004. [8] T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” J. Mach. Learn. Res., vol. 5, no. 975-1005, p. 4, 2004. [9] D. Dubois and H. Prade, Possibility theory. Springer, 1988. [10] G. J. Klir and B. Yuan, Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall New Jersey, 1995. [11] C. Frélicot and L. Mascarilla, “A third way to design pattern classifiers with reject options,” in Proc. 21st Int. Conf. of the North American Fuzzy Information Processing Society. IEEE, 2002, pp. 395–399. [12] C. Frélicot, L. Mascarilla, and A. Fruchard, “An ambiguity measure for pattern recognition problems using triangular-norms combination,” WSEAS Trans. Syst., vol. 8, no. 3, pp. 2710–2715, 2004. [13] L. Mascarilla, M. Berthier, and C. Frélicot, “A k-order fuzzy OR operator for pattern classification with k-order ambiguity rejection,” Fuzzy Sets Syst., vol. 159, no. 15, pp. 2011–2029, 2008. [14] T. M. Hospedales, S. Gong, and T. Xiang, “Finding rare classes: Active learning with generative and discriminative models,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 2, pp. 374–386, 2013. [15] R. Yan and A. Hauptmann, “Multi-class active learning for video semantic feature extraction,” in Proc. 2004 IEEE Int. Conf. Multimedia and Expo, vol. 1. IEEE, 2004, pp. 69–72. [16] R. Yan, J. Yang, and A. Hauptmann, “Automatically labeling video data using multi-class active learning,” in Proc. 9th ICCV. IEEE, 2003, pp. 516–523. [17] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable active learning for multiclass image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2259–2273, 2012. [18] D. Dubois, H. Prade, and S. Sandri, “On possibility/probability transformations,” in Fuzzy logic, 1993, pp. 103–112. [19] X. Z. Wang, L. C. Dong, and J. H. Yan, “Maximum ambiguity-based sample selection in fuzzy decision tree induction,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, pp. 1491–1505, 2012. [20] R. Wang, Y.-L. He, C.-Y. Chow, F.-F. Ou, and J. Zhang, “Learning ELMtree from big data based on uncertainty reduction,” Fuzzy Sets Syst., vol. 258, pp. 79–100, 2015. [21] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to medical image classification,” in Proc. 23rd ICML. ACM, 2006, pp. 417–424. [22] L. Bottou and C.-J. Lin, “Support vector machine solvers,” Large Scale Kernel Machines, pp. 301–320, 2007. [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998. [24] B. Demir, C. Persello, and L. Bruzzone, “Batch-mode active-learning methods for the interactive classification of remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 3, pp. 1014–1031, 2011. Ran Wang (S’09-M’14) received her B.Eng. degree in computer science from the College of Information Science and Technology, Beijing Forestry University, China, in 2009, and the Ph.D. degree from City University of Hong Kong, in 2014. She is currently a Postdoctoral Senior Research Associate at the Department of Computer Science, City University of Hong Kong. Since 2014, she is also an Assistant Researcher at the Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. Her current research interests include pattern recognition, machine learning, fuzzy sets and fuzzy logic, and their related applications. Chi-Yin Chow received the M.S. and Ph.D. degrees from the University of Minnesota-Twin Cities in 2008 and 2010, respectively. He is currently an assistant professor in Department of Computer Science, City University of Hong Kong. His research interests include spatio-temporal data management and analysis, GIS, mobile computing, and location-based services. He is the co-founder and co-organizer of ACM SIGSPATIAL MobiGIS 2012, 2013, and 2014. Sam Kwong (M’93-SM’04-F’13) received the B.Sc. and M.S. degrees in electrical engineering from the State University of New York at Buffalo in 1983, the University of Waterloo, ON, Canada, in 1985, and the Ph.D. degree from the University of Hagen, Germany, in 1996. From 1985 to 1987, he was a Diagnostic Engineer with Control Data Canada. He joined Bell Northern Research Canada as a Member of Scientific Staff. In 1990, he became a Lecturer in the Department of Electronic Engineering, City University of Hong Kong, where he is currently a Professor and Head in the Department of Computer Science. His main research interests include evolutionary computation, video coding, pattern recognition, and machine learning. Dr. Kwong is an Associate Editor of the IEEE Transactions on Industrial Electronics, the IEEE Transactions on Industrial Informatics, and the Information Sciences Journal.