Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Transfer Learning Using Task-Level Features with Application to Information Retrieval Rong Yan IBM T. J. Watson Research Hawthorne, NY, USA Jian Zhang Purdue University West Lafayette, IN, USA yanr@us.ibm.com jianzhan@stat.purdue.edu Abstract We propose a probabilistic transfer learning model that uses task-level features to control the task mixture selection in a hierarchical Bayesian model. These task-level features, although rarely used in existing approaches, can provide additional information to model complex task distributions and allow effective transfer to new tasks especially when only limited number of data are available. To estimate the model parameters, we develop an empirical Bayes method based on variational approximation techniques. Our experiments on information retrieval show that the proposed model achieves significantly better performance compared with other transfer learning methods. 1 Introduction Recent years have seen a growing interest in developing algorithms to learn multiple related tasks and dynamically transfer existing models to new tasks. This is known as transfer learning [Thrun and Pratt, 1998], or multi-task learning [Caruana, 1997]. It has been empirically as well as theoretically shown that transfer learning can significantly improve the learning performance especially when only a small number of data are available for new tasks. Transfer learning has many practical applications, and one such example is information retrieval. In more details, information retrieval can be cast into binary classification that considers the relevant query-document pairs as positive data and irrelevant pairs as negative data. By viewing each query as a separate task, we can reformulate the retrieval task into a transfer learning problem, which aims to predict the relevance of each document for a new query based on the manual relevance judgment from the training queries. Many transfer learning methods assume that the model parameters are generated from a uni-modal distribution (such as normal) for all given tasks. However, such an assumption is very likely to be violated in practice. As a matter of fact, tasks in many domains tend to be grouped into a number of clusters, and different tasks can be associated with different clusters. In such cases, it is of great importance for the transfer learning algorithms to effectively capture task-cluster associations and route a new task to the correct cluster. Although a simple mixture model instead of a uni-modal distribution can capture the above multi-cluster situation, the goal is still difficult to achieve when each task only possesses a limited number of training examples. Even worse, a new task can appear without any training examples at all. Unfortunately, transferring to a new task with few examples is typical in real applications such as information retrieval. To approach this issue, we consider incorporating into transfer learning a set of features which can directly represent the properties of the task itself, called “task-level features”, in additional to the commonly-used data features generated from training examples. Task-level features can be extracted in a flexible way. In information retrieval, task features can be defined as the properties of user queries, such as whether the query contains person names or not. Similarly, task features can also be generated from user profiles in collaborative filtering task, and school-level information for predicting student exam scores [Bakker and Heskes, 2003]. Intuitively, these task features provide helpful evidence to pinpoint which cluster a given task belongs to, especially when only a limited number of data examples are available for individual tasks. But the values of task features have not been fully explored in the previous work of transfer learning. In this paper, we propose a probabilistic transfer learning(TL) model by introducing task-level features to a hierarchical Bayesian model for classification. We call it the TL-TF model in the rest of the paper. This model assumes the model parameters for different tasks are generated from a linear combination of basic logistic regression classifiers together with a small set of hidden sources. The distribution of hidden sources of each task is controlled by its task features, while the combination weights and the model parameters in basic classifiers are learned from the data examples. To estimate the parameters of TL-TF, we also develop an empirical Bayes method based on variational approximation techniques. When transferring to a new task, the weights of task mixtures can be directly derived from its task features, and thus the classification outputs can be computed without using any additional training data from the new task. As the first attempt to apply transfer learning with task features to information retrieval, our experiments show that the proposed model is much more effective than other TL methods that either discard the task features, or treat task features as another set of data features. 1315 2 Related Work In the literature, transfer learning has also been known as “multi-task learning” and “learning to learn”. Many methods have been proposed for transfer learning and multi-task learning, such as neural networks [Caruana, 1997],transformation methods [Breiman and Friedman, 1997], regularization learning methods [Evgeniou et al., 2005; Ando and Zhang, 2004], hierarchical Bayesian models [Heskes, 2000; Yu et al., 2005; Zhang et al., 2005] and etc. In particular, the proposed model is closely related to the work based on hierarchical Bayesian models, which provide a flexible yet compact representation of the structure in the data space. Thus, they allow models to fit existing data well and generalize well on unseen data. For example, [Heskes, 2000] presented a model for multi-task learning by assuming that response variables of each task follow a normal distribution. Empirical Bayes techniques are used to learn the model hyper-parameters. In [Yu et al., 2005], Gaussian processes are applied to model multiple related tasks and a single Gaussian component is used to capture the mean and covariance of all related tasks. [Zhang et al., 2005] further proposed a latent variable model for multi-task learning which assumes that tasks are sparse linear combination of basic classifiers. A more recent work [Xue et al., 2007] proposed a Dirichlet process based hierarchical Bayesian model for learning multiple related classifiers. However, none of above models take the task-level features into account in the learning process. By utilizing task features in a softmax gating function, [Bakker and Heskes, 2003] augmented the Bayesian neural network model to handle multi-task learning with real-valued outputs and additive noise. This approach has shown to be effective in multiple data collections. However, this study was mainly focusing on regression in a non-transfer setting. More recently, [Bonilla et al., 2007] presented a kernel multi-task regression method using Gaussian process with task features. In contrast, the proposed approach is a hierarchical Bayesian model with focus on classification problems under the transfer learning setting, which is more typical for applications such as information retrieval. Moreover, to our best knowledge, our work is the first attempt to apply transfer learning with task-level features to the information retrieval domain. 3 Transfer Learning with Task-Level Features In this section, we describe the details of the proposed TL-TF model by introducing mixture modeling and task-level features to hierarchical Bayesian models. First, let us assume there are K classification tasks. Each task is associated with a training data set D(k) = {X(k) , y(k) }, where the training (k) (k) data are X(k) = [x1 , . . . , xnk ]T ∈ Rnk ×F and the labels (k) (k) are y(k) = [y1 , . . . , ynk ]T . y(k) ∈ {0, 1}nk ×1 for the classification tasks and y(k) ∈ Rnk ×1 for the regression tasks. Furthermore, for each task we are able to obtain a set of tasklevel features t(k) ∈ RP ×1 that describe the task properties. These features can be automatically extracted from the task description without knowing the data examples, such as the number of persons mentioned in queries in information retrieval and the age of users in collaborative filtering. One building block for transfer learning is to define the learning models for each individual task. In this paper, we consider using the parametric model f (k) (x) = f (k) (x|θ (k) ) to generate the data labels for each task, where the parameter θ(k) is used to index the prediction function f (k) .1 Given the parameter θ (k) , we can derive the following likelihood models for classification, (k) yi (k) ∼ Bernoulli(g(θ(k) , xi )) (1) −1 where g(t) = (1 + exp(−t)) is the logistic function and x, y is used to denote the inner product between x and y. This model becomes logistic regression if the model parameters θ(k) are estimated independently on each single task. In order to learn multiple related tasks more effectively, we can transform individual task learning process into a joint learning problem, which explains the relatedness of multiple tasks through some hidden sources s(k) . To handle the multi-cluster task distribution, we assume the hidden source variable s(k) is sampled from a multinomial distribution and use it to indicate which mixture component the task belongs to. Given that a set of task-level features t(k) are available for each task, we further assume the distribution parameters of sk are determined by its task features. Then a generative model can be constructed for θ(k) ’s in such a way that not only task dependency structure can be captured and jointly modeled, but also the information contained in t(k) can be effectively utilized. Formally, we can have the following hierarchical Bayesian model for generating θ(k) ’s: θ (k) s(k) = ∼ Λs(k) + e(k) Multinomial (p1 , . . . , pH ) e(k) ∼ Normal (0, Ψ) ph = exp(γ Th t(k) ) T (k) ) h exp(γ h t (2) where θ (k) ∈ RF ×1 are the model parameters for the k th task, t(k) ∈ RP ×1 are the task features, s(k) ∈ RH×1 are the hidden source vector2 where H is the number of task clusters and γ h denotes the distribution parameters for the softmax function ph , Λ ∈ RF ×H is the linear transformation matrix and e(k) is a multi-variate Gaussian noise vector with a covariance matrix Ψ ∈ RF ×F . To further simplify derivation, we use Γ = [γ 1 , . . . , γ H ] ∈ RP ×H to denote the parameter matrix of the multi-class logistic regression where γ h is its hth column, and similar notation is used for Λ = [λ1 , . . . , λH ] where λh is the h-th column of Λ. Also note that due to the properties of multi-class logistic regression [Hastie et al., 2001] we can fix γ 1 to be the all zero vector 0. The probabilistic model is complete by combining Eqn (1) and (2). From the model definition, we can find that: 1. θ(k) Note that it is not necessary to require all f (k) ’s belonging to the same parametric family, however, parameters θ (1) , . . . , θ (K) ∈ Θ are assumed to belong to the same metric space. 2 The vector s(k) takes the forms of [0, . . . , 0, 1, 0, . . . , 0] where only one element is 1 and the rest are 0’s. We also use z (k) ∈ {1, . . . , H} instead of s(k) to denote the 1’s index for simplicity. 1316 1 is assumed to be normally distributed with the same covariance Ψ. Its mean is a linear combination of columns of Λ shared by all K tasks; 2. The combination weights for Λ are determined by the latent variable s(k) sampled from a multinomial distribution; 3. s(k) is controlled by the task-level features t(k) through a multi-class logistic regression model, where in some sense t(k) serves as a gating function to adapt the distribution of hidden sources task by task. Our model has connections to several previous approaches. For instance, in the case when H = 1 and without considering the task-level features, the proposed model is closely related to the multi-task linear models proposed in [Yu et al., 2005]. Furthermore, when H > 1 it is closely related to the latent independent component analysis (LICA) model proposed in [Zhang et al., 2005] which assumes the latent variable s(k) to follow a sparse prior distribution. However, in contrast to existing approaches, the proposed model distinguishes itself from the fact that task-level features t(k) are naturally incorporated into the generation process of latent variables s(k) . This allows us to model the task-mixture association more accurately and predict the mixture assignment for a new task even when few data examples are provided. As a next step, we need to transfer the parameters of the learned model to a new task (e.g., a new query or a new user) with a limited number of training data or even no training data. We are interested in investigating whether the learning of a new task can benefit from generalizing the previous task parameters and whether the task features can be helpful to provide more accurate predictions. In this case, it is key to develop a generative model, e.g. we have to make explicit assumptions about how tasks are related. We can observe from our generative model that given the learned parameters Γ, Λ and Ψ from previous K tasks, we can naturally extend the generation process for the (K + 1)-th task to be θ(K+1) ∼ Normal(Λs(K+1) , Ψ) s(K+1) ∼ Multinomial(p1 , . . . , pH ), ph = exp(γ Th t(k+1) ) , T (k+1) ) h exp(γ h t where t(k+1) is the task feature for the new task. Finally, for a given input data vector x, its prediction is given by p(y|x) = H ph p(θ (K+1) |λh , Ψ)p(y|x, θ(K+1) )dθ (K+1) . h=1 If we want to reduce the computational complexity in the prediction step, an alternative is to use the MAP estimation of θ(K+1) to avoid the computation of the integral, where the prediction function can be rewritten as, p(y|x) = H 4 Learning and Inference We present the learning and inference algorithms for TL-TF in this section. Given the model definition of TL-TF, we need to estimate the parameters Λ, Γ and Ψ. Here we take the empirical Bayes approach by integrating out the random variables s(k) ’s and θ (k) ’s. Thus, the log-likelihood of the parameters Ω = {Λ, Γ, Ψ} for observed data {X(k) , y(k) , t(k) }K k=1 is “ ” log p y(1) , . . . , y(K) | Ω, X(1) , t(1) , . . . , X(K) , t(K) = K X log k=1 Z × X p(s(k) |Γ, t(k) ) s p(θ (k) |Λ, s(k) , Ψ) nk Y (k) (k) p(yi |θ (k) , xi )dθ (k) , i=1 where p(s(k) |Γ, t(k) ) is the multinomial distribution of the hidden sources, p(θ(k) |Λ, s(k) ) is a normal distribution with (k) (k) mean λz(k) and covariance matrix Ψ, and p(yi |θ(k) , xi ) is the likelihood function of classification in Eqn (1). Such an estimation problem can be solved by the EM algorithm, of which the detailed derivations are given as follows. To be more specific, the goal of learning is to estimate the parameters Ω by maximizing the log-likelihood over all K tasks. Since the log-likelihood function involves two set of hidden variables, i.e., s(k) ’s and θ(k) ’s, EM is applied to iteratively solve a series of simpler problems. E-step: Given the parameters Ω all tasks are decoupled, the E-step can be conducted for each task separately. Thus we only need to consider one task per time and we can omit the superscript (k) for simplicity. But because it is generally intractable to do an exact inference for classification problems, we decided to apply variational methods as one type of approximate inference techniques to optimize the objective function. The basic idea of variational methods is to use a tractable family of distributions q(θ, s) to approximate the true posterior distribution. Specifically we assume an auxiliary distribution q(θ, s) = q1 (s)q2 (θ), e.g. the mean field approximation, as a surrogate to approximate the true posterior distribution p(θ, s|Ω, X, t, y). Furthermore, we assume that q1 (s) = q1 (s|Φ) has the form of multinomial distribution with parameters φ = H [φ1 , . . . , φH ]T such that φh ≥ 0 and h=1 φh = 1. q2 (θ) = q2 (θ|m, V) has the form of a multivariate normal with mean m and covariance matrix V. We need to find the best set of variational parameters φ, m and V such that the following lower bound is maximized, which is equivalent to minimize the KL divergence between q1 (s)q2 (θ) and p(θ, s|Ω, X, t, y): ≥ ph p(y|x, λh ). log p(y|Ω, X, t) X q1 (s|φ)Eq2 [log p(y, θ, s|Ω, X, t)] + H(s) + H(θ) s h=1 = This formula can be efficiently computed and thus we adopt it in our following experiments. 1317 H X φh {E[log p(z = h|Ω, t) + E[log p(θ|λh , Ψ)]} h=1 +E[log p(y|θ, X)] + H(s) + H(θ) where H(θ) = − q2 (θ) log q2 (θ)dθ is the entropy of θ, H(s) = − φh log φh is the entropy of s, and the expected values of the first two terms on the right hand side are E [p(θ|λ.h , Ψ)] = E [log p(s = h|Ω, t)] = (a) (b) (c) (d) n „ X yi m T x i − ξ i log g(ξi ) + 2 i=1 “ ”” +h(ξi ) xTi (V + mmT )xi − ξi2 ≥ where h(t) = (1/2 − g(t))/(2t), g(t) is the logistic function and n is the number of data for the task. By substituting the additional lower bound back to Eqn(3), we are able to optimize the lower bound of the log-likelihood function with respect to V, m, φ to obtain the following E-step: ξi V = = [xTi (V + mmT )xi ]1/2 Ψ−1 − 2 n X !−1 g(ξi )xi xTi (3) (4) i=1 m = φh ∝ ! H n X 1X −1 yi x i + Ψ φh λ h V 2 i=1 h=1 „ « 1 T T −1 exp γ h t − (m − λh ) Ψ (m − λh ) 2 (5) (6) These fixed point equations should be repeated over ξi ’s, m, V and φh ’s until the lower bound is maximized. Upon convergence, we can use the resulting q1 (s|φ)q2 (θ|m, V) as a surrogate to the true posterior probability p(s, θ|Ω, X, t, y). M-step: Given the sufficient statistics obtained in the Estep, the M-step can be derived similarly by optimizing the lower bound of log-likelihood with respect to the model parameters, i.e., Γ,8Λ, Ψ: “ ” 39 2 K X H <X = exp γ Th t(k) (k) ´5 ` T φh log 4 P Γ̂ = arg min (k) ; Γ :k=1 h=1 h exp γ h t # "P P (k) (k) K K (k) (k) k=1 φ1 m k=1 φH m (7) Λ̂ = ,..., P PK (k) (k) K k=1 φ1 k=1 φH ! K H X 1 X (k) V(k) + Ψ̂ = φh (m(k) − λh )(m(k) − λh )T K k=1 1. Initialize parameters Λ, Γ and Ψ. 2. E-step: For the k-th task (k = 1, . . . , K): ˜ 1 ˆ 1 − log |2πΨ| − Tr Ψ−1 V 2 2 1 − (m − λh )T Ψ−1 (m − λh ) 2 exp(γ Th t) . log P T h exp(γ h t) However, the derivation is not complete yet because we cannot derive a closed-form representation for the term E[log p(y|θ, X)], where p(y|θ, X) = i p(yi |θ, xi ) is a product of logistic likelihood functions. So we resort to another variational technique proposed in [Jaakkola and Jordan, 1997] to compute its lower bound as a function of m and V by introducing a new variational parameters ξi for the ith example for the given task: E[log p(y|θ, X)] Algorithm 1 Empirical Bayes method for classification tasks 3. M-step: Update parameters according to Eqn(7) 4. Continue steps 2-3 until convergence. 5 Experiments: Information Retrieval In this section, we examine the effectiveness of the proposed model for an information retrieval task on large-scale video collections. We compared the proposed TL-TF model with several baseline models, including a single task learning (STL) model by directly estimating the likelihood in Eqn (1), a transfer learning model with a single cluster (TL-SC) and a TL model with multiple clusters that does not utilize any task features(TL-MC). To compare in a fair manner, we also evaluate another multi-cluster TL model that combines task features with the set of data features without using a taskdependent gating function in Eqn(2) (TL-Comb). 5.1 Experimental Setting We evaluated our learning algorithms using the video collections officially provided by the TREC video retrieval evaluation(TRECVID) 2002-2005 [Smeaton and Over, 2003]. Video shots are used as the retrieval unit. For each query, average precision is adopted as a measure of retrieval effectiveness. In more details, let R be the number of true relevant documents in a set of size S. At any given index j let Rj be the number of relevant documents in the top j documents. Let Ij = 1 if the j th document is relevant and 0 otherwise. The S R average precision (AP) is then defined as R1 j=1 jj ∗ Ij . Mean average precision (Mean-AP) is the mean of average precision for all queries in the collection. We split the four video collections, i.e., TREC’02-’05, into a development set to learn model parameters, and a search set to evaluate retrieval performance. For each search set, NIST officially provided 25 query topics, including both text description and image/video examples, together with their relevance judgment. On the other hand, the four development collections are combined into a single training corpus. Several human annotators collaboratively defined 88 query topics – which are different from the testing queries – and collected Data Set Query Num Doc Num h=1 In case we want to reduce the number of parameters we can assume that Ψ is diagonal with isotropic variance, e.g. Ψ = τ 2 I, and we have τ̂ 2 = Tr(Ψ̂)/F . The EM learning process is summarized in Algorithm 1. (k) update ξi (i = 1, . . . , nk ) based on Eqn(3) update V(k) and m(k) based on Eqn(4) and Eqn(5) update φ(k) based on Eqn(6) continue (a)-(c) until convergence t02 25 24263 t03 25 75850 t04 24 48818 t05 24 77979 dev 88 124097 Table 1: Labels of the video collections and statistics. t ∗ ∗ indicate the TRECVID testing sets, where the embedded number is the year, and dev is the development set. 1318 Data t02 t03 t04 t05 Method STL TL-SC(H=1) TL-MC(H=6) TL-Comb(H=6) TL-TF(H=6) STL TL-SC(H=1) TL-MC(H=6)* TL-Comb(H=6)* TL-TF(H=6)* STL TL-SC(H=1) TL-MC(H=6) TL-Comb(H=6) TL-TF(H=6)* STL TL-SC(H=1) TL-MC(H=6) TL-Comb(H=6) TL-TF(H=6) Mean-AP 0.114(+0%) 0.119(+4%) 0.123(+8%) 0.124(+9%) 0.136(+19%) 0.150(+0%) 0.177(+18%) 0.192(+28%) 0.190(+27%) 0.203(+35%) 0.079(+0%) 0.089(+12%) 0.093(+17%) 0.094(+17%) 0.109(+39%) 0.095(+0%) 0.095(+0%) 0.097(+2%) 0.099(+4%) 0.118(+21%) P30 0.135 0.123 0.132 0.132 0.138 0.212 0.224 0.228 0.227 0.243 0.177 0.186 0.185 0.186 0.190 0.238 0.242 0.240 0.242 0.243 P100 0.081 0.081 0.083 0.084 0.082 0.136 0.134 0.140 0.139 0.136 0.116 0.114 0.110 0.111 0.123 0.205 0.198 0.195 0.194 0.207 Person 0.199 0.290 0.293 0.293 0.387 0.251 0.366 0.428 0.429 0.460 0.144 0.178 0.185 0.186 0.252 0.164 0.159 0.139 0.141 0.180 SObj 0.205 0.183 0.206 0.206 0.208 0.293 0.314 0.350 0.350 0.345 0.063 0.067 0.036 0.037 0.069 0.029 0.021 0.021 0.020 0.047 GObj 0.074 0.062 0.059 0.059 0.049 0.095 0.103 0.081 0.080 0.096 0.034 0.037 0.038 0.037 0.045 0.090 0.091 0.075 0.077 0.072 Sport 0.007 0.017 0.009 0.009 0.006 0.101 0.070 0.102 0.100 0.109 0.108 0.085 0.100 0.099 0.111 0.271 0.200 0.293 0.294 0.329 Other 0.019 0.018 0.020 0.021 0.026 0.012 0.012 0.013 0.013 0.014 0.051 0.061 0.035 0.035 0.047 0.017 0.019 0.016 0.016 0.015 Table 2: Comparison of retrieval performance between STL, TL-SC, TL-MC, TL-Comb and TL-TF on multiple testing sets. All parameters are learned from the development set with the 88 training queries. * means statistical significance over STL with p-value < 0.01 (sign tests). their relevance judgment on the development set. Table 1 lists the labels of video collections and their query/document statistics. As building blocks for information retrieval, we generated a number of ranking features on each video document, including 14 high-level semantic features learned from development data (face, anchor, commercial, studio, graphics, weather, sports, outdoor, person, crowd, road, car, building, motion), and 5 uni-modal retrieval experts (text retrieval, face recognition, image-based retrieval based on color, texture and edge histograms) [Yan et al., 2004]. For the transfer learning algorithms, the covariance matrix Ψ is initialized to an identity matrix. The parameters of γ h and λh are initialized to random values uniformly sampled from [0, 1]. The EM algorithm stops when the relative change of the log-likelihood is less than 10−5 . We also designed the following 10 binary task features for the TL-TF model, which indicate if the query topics contain (1) specific person names; (2) specific object names; (3) more than two noun phrases; (4) words related to people/crowd; (5) words related to sports; (6) words related to vehicle; (7) words related to motion; (8) similar image examples; (9) image examples with faces; (10) if the text retrieval module finds more than 100 documents. All these task features can be automatically detected from the query description through manually defined rules plus natural language processing and image processing techniques [Yan et al., 2004]. 5.2 Retrieval Results We present the information retrieval results on all the testing sets using the proposed models and parameters learned from the development collection. The number of task clusters can be estimated by optimizing the regularized log-likelihood with an additional BIC term. Based on development data, we estimated that the optimal number of task mixtures is 6. Table 2 provided a detailed comparison between TL-TF with 6 query clusters (TL-TF, H=6) and several baseline algorithms including single task learning using logistic regression (STL), single-cluster TL (TL-SC, H=1), multi-cluster TL with 6 clusters(TL-MC, H=6), and TL-Comb with 6 clusters (TL-Comb, H=6)3 . All the results are reported in terms of the mean average precisions (Mean-APs) up to 1000 documents, precision at top 30 and 100 documents. We also grouped the queries in each collection and reported their Mean-APs in five different categories, i.e., named person, special object, general object, sports and general queries. We can observe from the results that TL-MC is usually superior to the STL and TL-SC algorithms, because user queries are typically organized into a limited number of clusters in the query feature space. However without task features, TL-MC needs to assume all new queries share the same parameter distribution, which limits the flexibility of transfer learning to produce more accurate results. TL-Comb, which incorporates task features as additional data features, offers slight gains over TL-MC, but the improvement is limited because it cannot dynamically change the clustering assignment when the queries vary. In contrast to TL-MC, TL-TF is able to leverage more powers from TL by routing new queries to the correct task clusters. On average, it brings a roughly 4% absolute improvement (or 30% relative improvement) over STL in terms of mean average precision. As it results, the difference between TL-TF and STL becomes statistically significant in two out of four collections. By comparing Mean-APs with respect to each query type, we find that TL-TF benefits most from the Person and Special Object type queries, as well as the Sports type in t05. This is because these query types have 3 All our baseline results are among the top performance in each year’s TRECVID evaluation 1319 0.12 0.11 0.115 Mean Average Precision Mean Average Precision 0.115 0.105 0.1 0.095 0.09 0.11 0.105 0.1 0.095 MTL−TF MTL−MC 0.085 1 2 3 4 5 6 7 8 9 MTL−TF MTL−MC 0.09 1 10 # of Task Mixtures 2 3 4 5 6 7 8 9 10 # of Task Mixtures Figure 1: Comparison of TL-TF and TL-MC vs. the number of task clusters on two datasets. Left: t04, Right: t05 their information needs clearly defined and thus they are able to be improved by making better use of the training data. To further analyze the performance of the proposed approaches, Figure 1 compares the TL-TF and TL-MC learning curves on two testing sets with the number of query clusters growing from 1 to 10. As we can see, TL-TF produces noticeably better results than TL-MC when the task clusters is larger than 2. The mean average precision of TL-TF increase dramatically using more task clusters especially when the cluster number is under four. For example, in t05 the mean average precision is boosted from 9.5% with one cluster to 12% with four clusters. This clearly shows the benefits of using task features to model complex task distributions compared with its non-task-feature counterpart. 6 Conclusion In this paper, we propose a probabilistic transfer learning model called TL-TF, of which the basic idea is to use tasklevel features to control the task mixture selection in a hierarchical Bayesian model. In this model, each task predictor is modeled as a linear combination of a set of basic classifiers, and the task relatedness is explained by some hidden sources. We use task features to determine the distribution parameters of these hidden sources via a multi-class logistic regression model. As a result, each task predictor is able to choose the specific task mixture component more accurately based on the characteristics described by task features. To estimate the parameters of TL-TF, we also develop an empirical Bayes method based on variational approximation techniques. Experimental results on several information retrieval collections show that the proposed model is much more effective than the other transfer learning methods that either do not utilize task features, or simply treat them as data features. References [Ando and Zhang, 2004] R. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Technical Report RC23462, IBM Research, 45, 2004. [Bonilla et al., 2007] E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th AISTATS, 2007. [Breiman and Friedman, 1997] L. Breiman and J.H. Friedman. Predicting multivariate responses in multiple linear regression. J. Royal. Statist. Soc B., 59(1):3–54, 1997. [Caruana, 1997] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997. [Evgeniou et al., 2005] T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005. [Hastie et al., 2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, first edition, 2001. [Heskes, 2000] Tom Heskes. Empirical bayes for learning to learn. In Proc. 17th International Conf. on Machine Learning, pages 367–374. Morgan Kaufmann, San Francisco, CA, 2000. [Jaakkola and Jordan, 1997] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression models and their extensions. In Proceedings of 6th International Worksop on AI and Statistics, 1997. [Smeaton and Over, 2003] A.F. Smeaton and P. Over. TRECVID: Benchmarking the effectiveness of information retrieval tasks on digital video. In Proc. of the Intl. Conf. on Image and Video Retrieval, 2003. [Thrun and Pratt, 1998] S. Thrun and L. Pratt. Learning to Learn. Kluwer Academic Publishers, 1998. [Xue et al., 2007] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with dirichlet process priors. J. Mach. Learn. Res., 8:35–63, 2007. [Yan et al., 2004] R. Yan, J. Yang, and A. G. Hauptmann. Learning query-class dependent weights in automatic video retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 548–555, 2004. [Yu et al., 2005] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of 22nd International Conference on Machine Learning (ICML), 2005. [Zhang et al., 2005] J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent component analysis. In Neural Information Processing Systems 18, 2005. [Bakker and Heskes, 2003] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res., 4:83–99, 2003. 1320