Self-paced Curriculum Learning Lu Jiang1 , Deyu Meng1,2 , Qian Zhao1,2 , Shiguang Shan1,3 , Alexander G. Hauptmann1 1 2 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 15213 School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi, P. R. China, 710049 3 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China, 100190 lujiang@cs.cmu.edu, dymeng@mail.xjtu.edu.cn, timmy.zhaoqian@gmail.com, sgshan@ict.ac.cn, alex@cs.cmu.edu Abstract Curriculum learning (CL) or self-paced learning (SPL) represents a recently proposed learning regime inspired by the learning process of humans and animals that gradually proceeds from easy to more complex samples in training. The two methods share a similar conceptual learning paradigm, but differ in specific learning schemes. In CL, the curriculum is predetermined by prior knowledge, and remain fixed thereafter. Therefore, this type of method heavily relies on the quality of prior knowledge while ignoring feedback about the learner. In SPL, the curriculum is dynamically determined to adjust to the learning pace of the leaner. However, SPL is unable to deal with prior knowledge, rendering it prone to overfitting. In this paper, we discover the missing link between CL and SPL, and propose a unified framework named self-paced curriculum leaning (SPCL). SPCL is formulated as a concise optimization problem that takes into account both prior knowledge known before training and the learning progress during training. In comparison to human education, SPCL is analogous to “instructor-student-collaborative” learning mode, as opposed to “instructor-driven” in CL or “student-driven” in SPL. Empirically, we show that the advantage of SPCL on two tasks. Curriculum learning (Bengio et al. 2009) and self-paced learning (Kumar, Packer, and Koller 2010) have been attracting increasing attention in the field of machine learning and artificial intelligence. Both the learning paradigms are inspired by the learning principle underlying the cognitive process of humans and animals, which generally start with learning easier aspects of a task, and then gradually take more complex examples into consideration. The intuition can be explained in analogous to human education in which a pupil is supposed to understand elementary algebra before he or she can learn more advanced algebra topics. This learning paradigm has been empirically demonstrated to be instrumental in avoiding bad local minima and in achieving a better generalization result (Khan, Zhu, and Mutlu 2011; Basu and Christensen 2013; Tang et al. 2012). A curriculum determines a sequence of training samples which essentially corresponds to a list of samples ranked in ascending order of learning difficulty. A major disparity c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. between curriculum learning (CL) and self-paced learning (SPL) lies in the derivation of the curriculum. In CL, the curriculum is assumed to be given by an oracle beforehand, and remains fixed thereafter. In SPL, the curriculum is dynamically generated by the learner itself, according to what the learner has already learned. The advantage of CL includes the flexibility to incorporate prior knowledge from various sources. Its drawback stems from the fact that the curriculum design is determined independently of the subsequent learning, which may result in inconsistency between the fixed curriculum and the dynamically learned models. From the optimization perspective, since the learning proceeds iteratively, there is no guarantee that the predetermined curriculum can even lead to a converged solution. SPL, on the other hand, formulates the learning problem as a concise biconvex problem, where the curriculum design is embedded and jointly learned with model parameters. Therefore, the learned model is consistent. However, SPL is limited in incorporating prior knowledge into learning, rendering it prone to overfitting. Ignoring prior knowledge is less reasonable when reliable prior information is available. Since both methods have their advantages, it is difficult to judge which one is better in practice. In this paper, we discover the missing link between CL and SPL. We formally propose a unified framework called Self-paced Curriculum Leaning (SPCL). SPCL represents a general learning paradigm that combines the merits from both the CL and SPL. On one hand, it inherits and further generalizes the theory of SPL. On the other hand, SPCL addresses the drawback of SPL by introducing a flexible way to incorporate prior knowledge. This paper also discusses concrete implementations within the proposed framework, which can be useful for solving various problems. This paper offers a compelling insight on the relationship between the existing CL and SPL methods. Their relation can be intuitively explained in the context of human education, in which SPCL represents an “instructor-student collaborative” learning paradigm, as opposed to “instructordriven” in CL or “student-driven” in SPL. In SPCL, instructors provide prior knowledge on a weak learning sequence of samples, while leaving students the freedom to decide the actual curriculum according to their learning pace. Since an optimal curriculum for the instructor may not necessarily be optimal for all students, we hypothesize that given reason- able prior knowledge, the curriculum devised by instructors and students together can be expected to be better than the curriculum designed by either part alone. Empirically, we substantiate this hypothesis by demonstrating that the proposed method outperforms both CL and SPL on two tasks. The rest of the paper is organized as follows. We first briefly introduce the background knowledge on CL and SPL. Then we propose the model and the algorithm of SPCL. After that, we discuss concrete implementations of SPCL. The experimental results and conclusions are presented in the last two sections. Background Knowledge Curriculum Learning Bengio et al. proposed a new learning paradigm called curriculum learning (CL), in which a model is learned by gradually including from easy to complex samples in training so as to increase the entropy of training samples (Bengio et al. 2009). Afterwards, Bengio and his colleagues presented insightful explorations for the rationality underlying this learning paradigm, and discussed the relationship between CL and conventional optimization techniques, e.g., the continuation and annealing methods (Bengio, Courville, and Vincent 2013; Bengio 2014). From human behavioral perspective, evidence have shown that CL is consistent with the principle in human teaching (Khan, Zhu, and Mutlu 2011; Basu and Christensen 2013). The CL methodology has been applied to various applications, the key in which is to find a ranking function that assigns learning priorities to training samples. Given a training set D = {(xi , yi )}ni=1 , where xi denotes the ith observed sample, and yi represents its label. A curriculum is characterized by a ranking function γ. A sample with a higher rank, i.e., smaller value, is supposed to be learned earlier. The curriculum (or the ranking function) is often derived by predetermined heuristics for particular problems. For example, in the task of classifying geometrical shapes, the ranking function was derived by the variability in shape (Bengio et al. 2009). The shapes exhibiting less variability are supposed to be learned earlier. In (Khan, Zhu, and Mutlu 2011), the authors tried to teach a robot the concept of “graspability” - whether an object can be grasped and picked up with one hand, in which participants were asked to assign a learning sequence of graspability to various object. The ranking is determined by common sense of the participants. In (Spitkovsky, Alshawi, and Jurafsky 2009), the authors approached grammar induction, where the ranking function is derived in terms of the length of a sentence. The heuristic is that the number of possible solutions grows exponentially with the length of the sentence, and short sentences are easier and thus should be learn earlier. The heuristics in these problems turn out to be beneficial. However, the heuristical curriculum design may lead to inconsistency between the fixed curriculum and the dynamically learned models. That is, the curriculum is predetermined a priori and cannot be adjusted accordingly, taking into account the feedback about the learner. Self-paced Learning To alleviate the issue of CL, Koller’s group (Kumar, Packer, and Koller 2010) designed a new formulation, called selfpaced learning (SPL). SPL embeds curriculum design as a regularization term into the learning objective. Compared with CL, SPL exhibits two advantages: first, it jointly optimizes the learning objective together with the curriculum, and therefore the curriculum and the learned model are consistent under the same optimization problem; second, the regularization term is independent of loss functions of specific problems. This theory has been successfully applied to various applications, such as action/event detection (Jiang et al. 2014b), reranking (Jiang et al. 2014a), domain adaption (Tang et al. 2012), dictionary learning (Tang, Yang, and Gao 2012), tracking (Supančič III and Ramanan 2013) and segmentation (Kumar et al. 2011). Formally, let L(yi , g(xi , w)) denote the loss function which calculates the cost between the ground truth label yi and the estimated label g(xi , w). Here w represents the model parameter inside the decision function g. In SPL, the goal is to jointly learn the model parameter w and the latent weight variable v = [v1 , · · · , vn ]T by minimizing: n n X X min n E(w, v; λ) = vi L(yi , f (xi , w))−λ vi , w,v∈[0,1] i=1 i=1 (1) where λ is a parameter for controlling the learning pace. Eq. (1) indicates the loss of a sample is discounted by a weight. The objective of SPL is to minimize the weighted training loss P together with the negative l1 -norm regularizer n −kvk1 = − i=1 vi (since vi ≥ 0). A more general regularizer consists of both kvk1 and kvk2,1 (Jiang et al. 2014b). ACS (Alternative Convex Search) is generally used to solve Eq. (1) (Gorski, Pfeuffer, and Klamroth 2007). It is an iterative method for biconvex optimization, in which the variables are divided into two disjoint blocks. In each iteration, a block of variables are optimized while keeping the other block fixed. With the fixed w, the global optimum v∗ = [v1∗ , · · · , vn∗ ] can be easily calculated by: 1, L(yi , g(xi , w)) < λ, ∗ vi = (2) 0, otherwise. There exists an intuitive explanation behind this alternative search strategy: first, when updating v with a fixed w, a sample whose loss is smaller than a certain threshold λ is taken as an “easy” sample, and will be selected in training (vi∗ = 1), or otherwise unselected (vi∗ = 0); second, when updating w with a fixed v, the classifier is trained only on the selected “easy” samples. The parameter λ controls the pace at which the model learns new samples, and physically λ corresponds to the “age” of the model. When λ is small, only “easy” samples with small losses will be considered. As λ grows, more samples with larger losses will be gradually appended to train a more “mature” model. This strategy complies with the heuristics in most CL methods (Bengio et al. 2009; Khan, Zhu, and Mutlu 2011). However, since the learning is completely dominated by the training loss, the learning may be prone to overfitting. Moreover, it provides no way to incorporate prior guidance in learning. To the best of our knowledge, there has been no studies to incorporate prior knowledge into SPL, nor to analyze the relation between CL and SPL. Self-paced Curriculum Learning Model and Algorithm An ideal learning paradigm should consider both prior knowledge known before training and information learned during training in a unified and sound framework. Similar to human education, we are interested in constructing an “instructor-student collaborative” paradigm, which, on one hand, utilizes prior knowledge provided by instructors as a guidance for curriculum design (the underlying CL methodology), and, on the other hand, leaves students certain freedom to adjust to the actual curriculum according to their learning paces (the underlying SPL methodology). This requirement can be realized through the following optimization model. Similar in CL, we assume that the model is given a curriculum that is predetermined by an oracle. Following the notation defined above, we have: min nE(w,v; λ,Ψ) = w,v∈[0,1] n X vi L(yi ,g(xi ,w))+f (v;λ) i=1 (3) s.t. v ∈ Ψ T where v = [v1 , v2 , · · · , vn ] denote the weight variables reflecting the samples’ importance. f is called self-paced function which controls the learning scheme; Ψ is a feasible region that encodes the information of a predetermined curriculum. A curriculum can be mathematically described as: Definition 1 (Total order curriculum) For training samples X = {xi }ni=1 , a total order curriculum, or curriculum for short, can be expressed as a ranking function: γ : X → {1, 2, · · · , n}, where γ(xi ) < γ(xj ) represents that xi should be learned earlier than xj in training. γ(xi ) = γ(xj ) denotes there is no preferred learning order on the two samples. Definition 2 (Curriculum region) Given a predetermined curriculum γ(·) on training samples X = {xi }ni=1 and their T weight variables v = [v1 , · · · , vn ] . A feasible region Ψ is called a curriculum region of γ if 1. Ψ is a nonempty convex set; 2. for any xi , xj , if γ(x R R i ) < γ(xj ), it holds R pair of samples that Ψ vi dv > Ψ vj dv, where Ψ vi dv calculates the Rexpectation Rof vi within Ψ. Similarly if γ(xi ) = γ(xj ), v dv = Ψ vj dv. Ψ i The two conditions in Definition 2 offer a realization for curriculum learning. Condition 1 ensures the soundness for calculating the constraints. Condition 2 indicates that samples to be learned earlier should have larger expected values. The curriculum region physically corresponds to a convex region in the high-dimensional space. The area inside this region confines the space for learning the weight variables. The shape of the region weakly implies a prior learning sequence of samples, where the expected values for favored samples are larger. For example, Figure 1(b) illustrates an example of feasible region in 3D where the x, y, z axis represents the weight variable v1 , v2 , v3 , respectively. Without considering the learning objective, we can see that v1 tends to be learned earlier than v2 and v3 . This is because if we uniformly sample sufficient points in the feasible region of the coordinate (v1 , v2 , v3 ), the expected value of v1 is larger. Since prior knowledge is missing in Eq. (1), the feasible region is a unit hypercube, i.e. all samples are equally favored, as shown in Figure 1(a). Note the curriculum region should be confined within the unit hypercube since the constraints v ∈ [0, 1]n in Eq. (3). (a) SPL (b) SPCL Figure 1: Comparison of feasible regions in SPL and SPCL. Note that the prior learning sequence in the curriculum region only weakly affects the actual learning sequence, and it is very likely that the prior sequence will be adjusted by the learners. This is because the prior knowledge determines a weak ordering of samples that suggests what should be learned first. A learner takes this knowledge into account, but has his/her own freedom to alter the sequence in order to adjust to the learning objective. See an example in the supplementary materials. Therefore, SPCL represents an “instructor-student-corporative” learning paradigm. Compared with Eq. (1), SPCL generalizes SPL by introducing a regularization term. This term determines the learning scheme, i.e., the strategy used by the model to learn new samples. In human learning, we tend to use different schemes for different tasks. Similarly, SPCL should also be able to utilize different learning schemes for different problems. Since the existing methods only include a single learning scheme, we generalize the learning scheme and define: Definition 3 (Self-paced function) A self-paced function determines a learning scheme. Suppose that v = [v1 , · · · , vn ]T denotes a vector of weight variable for each training sample and ℓ = [ℓ1 , · · · , ℓn ]T are the corresponding loss. λ controls the learning pace (or model “age”). f (v; λ) is called a self-paced function, if 1. f (v; λ) is convex with respect to v ∈ [0, 1]n . 2. When all variables are fixed except for vi , ℓi , vi∗ decreases with ℓi , and it holds that lim vi∗ = 1, lim vi∗ = 0. ℓi →0 ℓi →∞ Pn 3. kvk1 = i=1 vi increases with respect to λ, and it holds that ∀i ∈ [1, n], lim vi∗ = 0, lim vi∗ = 1. λ→0 λ→∞ P where v∗ = arg minv∈[0,1]n vi ℓi + f (v; λ), and denote v∗ = [v1∗ , · · · , vn∗ ]. The three conditions in Definition 3 provide a definition for the self-paced learning scheme. Condition 2 indicates that the model inclines to select easy samples (with smaller losses) in favor of complex samples (with larger losses). Table 1: Comparison of different learning approaches. CL SPL Proposed SPCL Comparable to human learning Instructor-driven Student-driven Instructor-student collaborative Curriculum design Prior knowledge Learning objective Learning objective + prior knowledge Learning schemes Multiple Single Multiple Iterative training Heuristic approach Gradient-based Gradient-based Condition 3 states that when the model “age” λ gets larger, it should incorporate more, probably complex, samples to train a “mature” model. The convexity in Condition 1 ensures the model can find good solutions within the curriculum region. It is easy to verify that the regularization term in Eq. (1) satisfies Definition 3. In fact, this term corresponds to a binary learning scheme since vi can only take binary values, as shown in the closed-form solution of Eq. (2). This scheme may be less appropriate in the problems where the importance of samples needs to be discriminated. In fact, there exist a plethora of self-paced functions corresponding to various learning schemes. We will detail some of them in the next section. Inspired by the algorithm in (Kumar, Packer, and Koller 2010), we propose a similar ACS algorithm to solve Eq. (3). Algorithm 1 takes the input of a predetermined curriculum, an instantiated self-paced function and a stepsize parameter; it outputs an optimal model parameter w. First of all, it represents the input curriculum as a curriculum region that follows Definition 2, and initializes variables in their feasible region. Then it alternates between two steps until it finally converges: Step 4 learns the optimal model parameter with the fixed and most recent v∗ ; Step 5 learns the optimal weight variables with the fixed w∗ . In first several iterations, the model “age” is increased so that more complex samples will be gradually incorporated in the training. For example, we can increase λ so that µ more samples will be added in the next iteration. According to the conditions in Definition 3, the number of complex samples increases along with the growth of the number iteration. Step 4 can be conveniently implemented by existing off-the-shelf supervised learning methods. Gradient-based or interior-point methods can be used to solve the convex optimization problem in Step 5. According to (Gorski, Pfeuffer, and Klamroth 2007), the alternative search in Algorithm 1 converges as the objective function is monotonically decreasing and is bounded from below. Relationship to CL and SPL SPCL represents a general learning framework which includes CL and SPL as special cases. SPCL degenerates to SPL when the curriculum region is ignored (Ψ = [0, 1]n ), or equivalently, the prior knowledge on predefined curriculums is absent. In this case, the learning is totally driven by the learner. SPCL degenerates to CL when the curriculum region (feasible region) only contains the learning sequence in the predetermined curriculum. In this case, the learning process neglects the feedback about learners, and is dominated by the given prior knowledge. When information from both sources are available, the learning in SPCL is collaborative- Algorithm 1: Self-paced Curriculum Learning. input : Input dataset D, predetermined curriculum γ, self-paced function f and a stepsize µ output: Model parameter w 1 2 3 4 5 6 7 8 Derive the curriculum region Ψ from γ; Initialize v∗ , λ in the curriculum region; while not converged do Update w∗ = arg minw E(w, v∗ ; λ, Ψ); Update v∗ = arg minv E(w∗ , v; λ, Ψ); if λ is small then increase λ by the stepsize µ; end return w∗ ly driven by prior knowledge and learning objective. Table 1 summarizes the characteristics of different learning methods. Given reasonable prior knowledge, SPCL which considers the information from both sources tend to yield better solutions. The toy example in supplementary materials lists a case in this regard. SPCL Implementation The definition and algorithm in the previous section provide a theoretical foundation for SPCL. However, we still need concrete self-paced functions and curriculum regions to solve specific problems. To this end, this section discusses some implementations that follow Definition 2 and Definition 3. Note that there is no single implementation that can always work the best for all problems. As a pilot work on this topic, our purpose is to argument the implementations in the literature, and to help enlighten others to further explore this interesting direction. Curriculum region implementation: We suggest an implementation induced from a linear constraint for realizing the curriculum region: aT v ≤ c, where v = [v1 , · · · , vn ]T are the weight variables in Eq. (3), c is a constant, and a = [a1 , · · · , an ]T is a n-dimensional vector. The linear constraints is a simple implementation for curriculum region that can be conveniently solved. It can be proved that this implementation complies with the definition of curriculum region. See the proof in supplementary materials. Theorem 1 For training samples X = {xi }ni=1 , given a curriculum γ defined on it, the feasible region, defined by, Ψ = {v|aT v ≤ c} is a curriculum region of γ if it holds: 1) Ψ ∧ v ∈ [0, 1]n is nonempty; 2) ai < aj for all γ(xi ) < γ(xj ); ai = aj for all γ(xi ) = γ(xj ). Self-paced function implementation: Similar to the scheme human used to absorb knowledge, a self-paced func- tion determines a learning scheme for the model to learn new samples. Note the self-paced function is realized as a regularization term, which is independent of specific loss functions, and can be easily applied to various problems. Since human tends to use different learning schemes for different tasks, SPCL should also be able to utilize different learning schemes for different problems. Inspired by a study in (Jiang et al. 2014a), this section discusses some examples of learning schemes. Binary scheme: This scheme in is used in (Kumar, Packer, and Koller 2010). It is called binary scheme, or “hard” scheme, as it only yields binary weight variables. n X f (v; λ) = −λkvk1 = −λ vi , (4) i=1 Linear scheme: A common approach is to linearly discriminate samples with respect to their losses. This can be realized by the following self-paced function: n 1 X 2 f (v; λ) = λ (v − 2vi ), (5) 2 i=1 i in which λ > 0. This scheme represents a “soft” scheme as the weight variable can take real values. Logarithmic scheme: A more conservative approach is to penalize the loss logarithmically, which can be achieved by the following function: n X ζ vi f (v; λ) = ζvi − , (6) log ζ i=1 where ζ = 1 − λ and 0 < λ < 1. Mixture scheme: Mixture scheme is a hybrid of the “soft” and the “hard” scheme (Jiang et al. 2014a). If the loss is either too small or too large, the “hard” scheme is applied. Otherwise, the soft scheme is applied. Compared with the “soft” scheme, the mixture scheme tolerates small errors up to a certain point. To define this starting point, an additional parameter is introduced, i.e. λ = [λ1 , λ2 ]T . Formally, n X 1 f (v; λ) = −ζ log(vi + ζ), (7) λ 1 i=1 λ2 where ζ = λλ11−λ and λ1 > λ2 > 0. 2 Theorem 2 The binary, linear, logarithmic and mixture scheme function are self-paced functions. It can be proved that the above functions follow Definition 3. The name of the learning scheme suggests the characteristic of its solution. For example, denote ℓi = L(yi ,g(xi ,w)). When Ψ = [0, 1]n , the partial gradient of Eq. (3) using logarithmic scheme equals: ∂Ew = ℓi + (ζ − ζ vi ) = 0, (8) ∂vi where Ew denote the objective in Eq. (3) with the fixed w. We then can easily deduce: log(ℓi + ζ) = vi log ζ. The optimal solution for Ew is given by: ( 1 log(ℓi + ζ) ℓi < λ ∗ vi = log ζ 0 ℓi ≥ λ. (9) (10) As shown the solution of vi∗ is logarithmic to its loss ℓi . See supplementary materials for the analysis on other self-paced functions. When the curriculum region is not a unit hypercube, the closed-form solution, such as Eq. (10), cannot be directly used. Gradient-based methods can be applied. As Ew is convex, the local optimal is also the global optimal solution for the subproblem. Experiments We present experimental results for the proposed SPCL on two tasks: matrix factorization and multimedia event detection. We demonstrate that our approach outperforms baseline methods on both tasks. Matrix Factorization Matrix factorization (MF) aims to factorize an m × n data matrix Y, whose entries are denoted as yij s, into two smaller factors U ∈ Rm×r and V ∈ Rn×r , where r ≪ min(m, n), such that UVT is possibly close to Y (Chatzis 2014; Meng et al. 2013; Zhao et al. 2014). MF has many successful applications, such as structure from motion (Tomasi and Kanade 1992) and photometric stereo (Hayakawa 1994). Here we test SPCL scheme on synthetic MF problems. The data were generated as follows: two matrices U and V, both of which are of size 40 × 4, were first randomly generated with each entry drawn from the Gaussian distribution N (0, 1), leading to a ground truth rank-4 matrix Y0 = UVT , and certain amount of noises were then specified to constitute the observation matrix Y. Specifically, 20% of the entries were added to uniform noise on [−50, 50], other 20% were added to uniform noise on [−40, 40], and the rest were added to Gaussian noise drawn from N (0, 0.12 ). We considered L2 - and L1 -norm MF methods, and incorporated the SPL and SPCL frameworks with the solvers proposed by Cabral et al. (2013) and Wang et al. (2012), respectively. The curriculum region was constructed by setting the weight vector v and c in the linear constraint as follows. For v, first, set ṽij = 50 for entries mixed with uniform noise on [−50, 50], ṽij = 40 for entries mixed with uniform noise on [−40, 40], and ṽij = 1 for the rest. Then v was calculated ṽ by vij = P ijṽij . For c, we specified it as 0.02 and 0.01 for L2 and L1 -norm MF, respectively. Two criteria were used for performance assessment. (1) 1 kY0 − ÛV̂T kF , and root mean square error (RMSE): √mn 1 (2) mean absolute error (MAE): mn kY0 − ÛV̂T k1 , where Û, V̂ denote the output from a utilized MF method. The performance of each method was evaluated as the average over 50 random realizations, as summarized in Table 2. Table 2: Performance comparison of SPCL and baseline methods for matrix factorization. RMSE MAE Baseline 9.3908 6.8597 L2 -norm MF SPL SPCL 0.2585 0.0654 0.0947 0.0497 Baseline 2.8671 1.4729 L1 -norm MF SPL SPCL 0.1117 0.0798 0.0766 0.0607 The results show that the baseline methods fail to obtain reasonable approximation to the ground truth matrices due to the large noises embedded in the data, while SPL and SPCL significantly improve the performance. Besides, SPCL outperforms SPL. This is because SPL is more sensitive to the starting values than SPCL, and inclines to overfit to the noises. In this case, SPCL can alleviate such issue, as depicted in Figure 2. Because SPCL is constrained by prior curriculum and can weight the noisy samples properly. L1−norm MF 10 4 Baseline SPL SPCL 6 4 L2−norm MF Baseline SPL SPCL 3 RMSE RMSE 8 2 1 2 0 5 10 Iteration 15 0 10 20 30 Iteration Figure 2: Comparison of the convergence of SPL and SPCL. Multimedia Event Detection (MED) Given a collection of videos, the goal of MED is to detect events of interest, e.g. “Birthday Party” and “Parade”, solely based on the video content. Since MED is a very challenging task, there have been many studies proposed to tackle this problem in different settings, which includes training detectors using sufficient examples (Wang et al. 2013; Gkalelis and Mezaris 2014; Tong et al. 2014), using only a few examples (Safadi, Sahuguet, and Huet 2014; Jiang et al. 2014b), by exploiting semantic features (Tan, Jiang, and Neo 2014; Liu et al. 2013; Zhang et al. 2014; Inoue and Shinoda 2014; Jiang, Hauptmann, and Xiang 2012; Tang et al. 2012; Yu, Jiang, and Hauptmann 2014; Cao et al. 2013), and by automatic speech recognition (Miao, Metze, and Rawat 2013; Miao et al. 2014; Chiu and Rudnicky 2013). We applied SPCL in a reranking setting, in which zero examples are given. It aims at improving the ranking of the initial search result. TRECVID Multimedia Event Detection (MED) 2013 Development, MED13Test and MED14Test sets were used (Over et al. 2013), which include around 34,000 Internet videos. The performance was evaluated on the MED13Test and MED14Test sets (25,000 videos), by the Mean Average Precision (MAP). There were 20 prespecified events on each dataset. Six types of visual and acoustic features were used. More information about these features is in (Jiang et al. 2014c). In CL, the curriculum was derived by the MMPRF (Jiang et al. 2014c). In SPL, the curriculum was derived by the learning objective according to Eq. (1) where the loss is the hinge loss. In SPCL, Algorithm 1 was used, where Step 5 was solved by LM-BFGS (Zhu et al. 1997) in “stats” package in the R language, and Step 4 was solved by a standard quadratic programming toolkit. Mixture scheme was used, and all parameters were carefully tuned on a validation set on a different set of events. The predetermined curriculum in MMPRF was encoded as linear constraints Av ≤ g to encode prior knowledge on modality weighting presented in (Jiang et al. 2014c). The intuition is that some features are more discriminative than others, and the constraints emphasize these discriminative features. As we see in Table 3, SPCL outperforms both CL and SPL. The improvement is statistically significant across Table 3: Performance comparison of SPCL and baseline methods for zero-example event reranking. Dataset MED13Test MED14Test CL 10.1 7.3 SPL 10.8 8.6 SPCL 12.9 9.2 20 events at the p-level of 0.05, according to the paired t-test. For this problem, “student-driven” learning mode (SPL) turns out better than “instructor-driven” mode (CL). “Instructor-student-collaborative” learning mode exploits prior knowledge and improves SPL. We hypothesize the reason is that SPCL takes advantage of the reliable prior knowledge and thus arrives at better solutions. The results substantiate the argument that learning with both prior knowledge and learning objective tends to be beneficial. Conclusions and Future Work We proposed a novel learning regime called self-paced curriculum learning (SPCL), which imitates the learning regime of humans/animals that gradually involves from easy to more complex training samples into the learning process. The proposed SPCL can exploit both prior knowledge before training and dynamical information extracted during training. The novel regime is analogous to an “instructor-studentcollaborative” learning mode, as opposed to “instructordriven” in curriculum learning or “student-driven” in selfpaced learning. We presented compelling understandings for curriculum learning and self-paced learning, and revealed that they can be unified into a concise optimization model. We discussed several concrete implementations in the proposed SPCL framework. Experimental results on two different tasks substantiate the advantage of SPCL. Empirically, we found that SPCL requires a validation set that follows the same underlying distribution of the test set for tuning parameters in some problems. Intuitively, the set is analogous to the mock exam in education whose purposes are to let students realize how well they would perform on the real test, and, importantly, have a better idea of what to study. Future directions may include developing new learning schemes for different problems. Since human tends to use different learning schemes to solve different problems, SPCL should utilize appropriate learning schemes for various problems at hand. Besides, currently as in curriculum learning, we assume the curriculum is total-order. We plan to relax this assumption in our future work. Acknowledgments This paper was partially supported by the US Department of Defense, U. S. Army Research Office (W911NF-13-1-0277) and by the National Science Foundation under Grant No. IIS-1251187. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ARO, the National Science Foundation or the U.S. Government. References Basu, S., and Christensen, J. 2013. Teaching classification boundaries to humans. In AAAI. Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on PAMI 35(8):1798–1828. Bengio, Y. 2014. Evolving culture versus local minima. In Growing Adaptive Machines. Springer. 109–138. Cabral, R.; De la Torre, F.; Costeira, J. P.; and Bernardino, A. 2013. Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In ICCV. Cao, L.; Gong, L.; Kender, J. R.; Codella, N. C.; and Smith, J. R. 2013. Learning by focusing: A new framework for concept recognition and feature selection. In ICME. Chatzis, S. P. 2014. Dynamic bayesian probabilistic matrix factorization. In AAAI. Chiu, J., and Rudnicky, A. 2013. Using conversational word bursts in spoken term detection. In Interspeech. Gkalelis, N., and Mezaris, V. 2014. Video event detection using generalized subclass discriminant analysis and linear support vector machines. In ICMR. Gorski, J.; Pfeuffer, F.; and Klamroth, K. 2007. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research 66(3):373–407. Hayakawa, H. 1994. Photometric stereo under a light source with arbitrary motion. Journal of the Optical Society of America A 11(11):3079–3089. Inoue, N., and Shinoda, K. 2014. n-gram models for video semantic indexing. In MM. Jiang, L.; Meng, D.; Mitamura, T.; and Hauptmann, A. G. 2014a. Easy samples first: Self-paced reranking for zeroexample multimedia search. In MM. Jiang, L.; Meng, D.; Yu, S.-I.; Lan, Z.; Shan, S.; and Hauptmann, A. G. 2014b. Self-paced learning with diversity. In NIPS. Jiang, L.; Mitamura, T.; Yu, S.-I.; and Hauptmann, A. G. 2014c. Zero-example event search using multimodal pseudo relevance feedback. In ICMR. Jiang, L.; Hauptmann, A. G.; and Xiang, G. 2012. Leveraging high-level and low-level features for multimedia event detection. In MM. Khan, F.; Zhu, X.; and Mutlu, B. 2011. How do humans teach: On curriculum learning and teaching dimension. In NIPS. Kumar, M.; Turki, H.; Preston, D.; and Koller, D. 2011. Learning specific-class segmentation from diverse data. In ICCV. Kumar, M.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NIPS. Liu, J.; Yu, Q.; Javed, O.; Ali, S.; Tamrakar, A.; Divakaran, A.; Cheng, H.; and Sawhney, H. 2013. Video event recognition using concept attributes. In WACV. Meng, D.; Xu, Z.; Zhang, L.; and Zhao, J. 2013. A cyclic weighted median method for l1 low-rank matrix factorization with missing entries. In AAAI. Miao, Y.; Jiang, L.; Zhang, H.; and Metze, F. 2014. Improvements to speaker adaptive training of deep neural networks. In SLT. Miao, Y.; Metze, F.; and Rawat, S. 2013. Deep maxout networks for low-resource speech recognition. In ASRU. Over, P.; Awad, G.; Michel, M.; Fiscus, J.; Sanders, G.; Kraaij, W.; Smeaton, A. F.; and Quenot, G. 2013. TRECVID 2013 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID. Safadi, B.; Sahuguet, M.; and Huet, B. 2014. When textual and visual information join forces for multimedia retrieval. In ICMR. Spitkovsky, V. I.; Alshawi, H.; and Jurafsky, D. 2009. Baby steps: How less is more in unsupervised dependency parsing. In NIPS. Supančič III, J., and Ramanan, D. 2013. Self-paced learning for long-term tracking. In CVPR. Tan, S.; Jiang, Y.-G.; and Neo, C.-W. 2014. Placing videos on a semantic hierarchy for search result navigation. TOMCCAP 10(4). Tang, K.; Ramanathan, V.; Li, F.; and Koller, D. 2012. Shifting weights: Adapting object detectors from image to video. In NIPS. Tang, Y.; Yang, Y. B.; and Gao, Y. 2012. Self-paced dictionary learning for image classification. In MM. Tomasi, C., and Kanade, T. 1992. Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision 9(2):137–154. Tong, W.; Yang, Y.; Jiang, L.; Yu, S.-I.; Lan, Z.; Ma, Z.; Sze, W.; Younessian, E.; and Hauptmann, A. G. 2014. Elamp: integration of innovative ideas for multimedia event detection. Machine vision and applications 25(1):5–15. Wang, N.; Yao, T.; Wang, J.; and Yeung, D. 2012. A probabilistic approach to robust matrix factorization. In ECCV. Wang, F.; Sun, Z.; Jiang, Y.; and Ngo, C. 2013. Video event detection using motion relativity and feature selection. IEEE Transactions on Multimedia. Yu, S.-I.; Jiang, L.; and Hauptmann, A. 2014. Instructional videos for unsupervised harvesting and learning of action examples. In MM. Zhang, H.; Yang, Y.; Luan, H.; Yang, S.; and Chua, T.-S. 2014. Start from scratch: Towards automatically identifying, modeling, and naming visual attributes. In MM. Zhao, Q.; Meng, D.; Xu, Z.; Zuo, W.; and Zhang, L. 2014. Robust principal component analysis with complex noise. In ICML. Zhu, C.; Byrd, R. H.; Lu, P.; and Nocedal, J. 1997. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale boundconstrained optimization. ACM Transactions on Mathematical Software 23(4):550–560.