Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016) Dynamic Data Capture from Social Media Streams: A Contextual Bandit Approach Thibault Gisselbrecht+∗ and Sylvain Lamprier∗ and Patrick Gallinari∗ + Technological Research Institute SystemX, 8 Avenue de la Vauve, 91120 Palaiseau, France thibault.gisselbrecht@irt-systemx.fr ∗ Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606, 4 place Jussieu, 75005 Paris firstname.lastname@lip6.fr nishes access to the data published by a particular user, under restrictive constraints related to the number of streams that can be simultaneously considered. On Twitter for instance, data capture is limited to 5000 simultaneous streams1 . Regarding the huge number of available users on this media, it requires an incredibly important effort for targeting relevant sources. Selecting relevant users to follow among the whole set of accounts is very challenging. Imagine someone interested in politics wishing to keep track of people in relation to this topic in order to capture the data they produce. One solution is to manually select a bucket of accounts and follow their activity along time. However, Twitter is known to be extremely dynamic and for any reason some users might start posting on this topic while others might stop at any time. So if the interested user wants to be up to date, he might change the subset of followed accounts dynamically, his goal being to anticipate which users are the most likely to produce relevant content in a close future. This task seems hard to handle manually for two major reasons. First, the criteria chosen to predict whether an account will potentially be interesting might be hard to define. Secondly, even if one would be able to manually define those criteria, the amount of data to analyze would be too large. Regarding these important issues, we propose a solution that, at each iteration of the process, automatically selects a subset of accounts that are likely to be relevant in the next time window, depending on their current activity. Those accounts are then followed during a certain time and the corresponding published contents are evaluated to quantify their relevance. The algorithm behind the system then learns a policy to improve the selection at the next time step. We tackle this task as a contextual bandit problem, in which at each round, a learner chooses an action among a bigger set of available ones, based on the observation of action features - also called context - and then receives a reward that quantifies the quality of the chosen action. In our case, considering that following someone corresponds to an action, several actions have to be chosen at each time step, since we wish to leverage the whole capture capacity allowed by the streaming API and therefore collect data from several simultaneous streams. Abstract Social Media usually provide streaming data access that enable dynamic capture of the social activity of their users. Leveraging such APIs for collecting social data that satisfy a given pre-defined need may constitute a complex task, that implies careful stream selections. With user-centered streams, it indeed comes down to the problem of choosing which users to follow in order to maximize the utility of the collected data w.r.t. the need. On large social media, this represents a very challenging task due to the huge number of potential targets and restricted access to the data. Because of the intrinsic nonstationarity of user’s behavior, a relevant target today might be irrelevant tomorrow, which represents a major difficulty to apprehend. In this paper, we propose a new approach that anticipates which profiles are likely to publish relevant contents - given a predefined need - in the future, and dynamically selects a subset of accounts to follow at each iteration. Our method has the advantage to take into account both API restrictions and the dynamics of users’ behaviors. We formalize the task as a contextual bandit problem with multiple actions selection. We finally conduct experiments on Twitter, which demonstrate the empirical effectiveness of our approach in real-world settings. 1 Introduction In recent years, many social websites that allow users to publish and share content online have appeared. For example Twitter, with 302 million active users and more than 500 millions posts every day, is one of the main actors of the market. These social media have become a very important source of data for many applications. Given a pre-defined need, two solutions are usually available for collecting useful data from such media: 1) getting access to huge repositories of historical data or 2) leveraging streaming services that most media propose to enable real-time tracking of their users’ activity. As the former solution usually implies very important costs for retrieving relevant data from big data warehouses, the latter may constitute a very relevant alternative that enables real-time access to focused data. However, it implies to be able to efficiently target relevant streams of data. In this paper, we consider the case of user-centered streams of social data, i.e. where each individual stream fur- 1 In this paper we focus on Twitter, but the approach is valid for many other social platforms. Copyright © 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 131 With the current activity of a certain user corresponding to its context features, our task suits well the contextual bandit framework. However, in its traditional instance, features of every action are required to be observed at each iteration to perform successive choices. For our specific task, we are not able to observe the current activity of every potential user. Consequently, we are not able to get everyone’s context and directly use classical contextual bandit algorithms such as LinUCB (Chu et al. 2011). On the other hand, it cannot be treated as a traditional (non-contextual) bandit problem since we would lose the useful information provided by observed context features. To the best of our knowledge, such an hybrid instance of bandit problem has not been studied yet. To solve our problem, we propose a data capture algorithm which, along with learning a selection policy of the k best streams at each round, learns on the feature distributions themselves, so as to be able to make approximations when features are hidden from the agent. The paper is organized as follows: Section 2 presents some related work. Section 3 formalizes the task and presents our data capture system. Section 4 then describes the model and the algorithm proposed to solve the task. Finally, section 5 reports various sets of experimental results. 2 from their posterior distributions, have been designed (Kaufmann, Korda, and Munos 2012; Agrawal and Goyal 2012a; Chapelle and Li 2011; Agrawal and Goyal 2012b). More recently, the case where the learner can play several actions simultaneously has been formalized respectively in (Chen, Wang, and Yuan 2013) (CUCB algorithm) and (Qin, Chen, and Zhu 2014) (C2 UCB algorithm) for the non contextual and the contextual case. Finally in (Gisselbrecht et al. 2015) the authors propose the CUCBV algorithm which extends the original UCBV of (Audibert, Munos, and Szepesvari 2007) to the multiple plays case. To the best of our knowledge, no algorithm exists for our case, i.e. when feature vectors are only observable with some given probability. Regarding social media, several existing tasks present similarities with ours. In (Li, Wang, and Chang 2013), the authors build a platform called ATM that is aimed at automatically monitoring target tweets from the Twitter Sample stream for any given topic. In their work, they develop a keyword selection algorithm that efficiently selects keywords to cover target tweets. In our case, we do not focus on keywords but on user profiles by modeling their past activity on the social media. In (Colbaugh and Glass 2011) the authors model the blogosphere as a network in which every node is considered as a stream that can be followed. Their goal is to identify blogs that contain relevant contents in order to track emerging topics. However, their approach is static and their model is trained on previously collected data. Consequently, this approach could not be applied to our task: first we cannot have access to the whole network due to APIs restrictions and secondly, the static aspect of their methods is not suitable to model the dynamics of users’ behaviors that can change very quickly. In (Hannon, Bennett, and Smyth 2010) and (Gupta et al. 2013), the authors build recommendation systems, respectively called Twittomender and Who to Follow. In the former collaborative filtering methods are used to find twitter accounts that are likely to interest a target user. In the latter, a circle of trust (which is the result of an egocentric random walk similar to personalized PageRank (Fogaras et al. 2005)) approach is adopted. In both cases, authors assume the knowledge of the whole followers/followees Twitter graph, which is not feasible for large scale applications due to Twitter restriction policies. Moreover, the task is quite different from ours. While these two models are concerned about the precision of the messages collected from the selected followees (in order to avoid relevant information to be drown in a too large amount of data), we rather seek at maximizing the amount of relevant data collected in a perspective of automatic data capture. Bandit algorithms have already been applied to various tasks related to social networks. For example in (Kohli, Salek, and Stoddard 2013), the authors handle abandonment minimization tasks in recommendation systems with bandits. In (Buccapatnam, Eryilmaz, and Shroff 2014), bandits are used for online advertising while in (Lage et al. 2013), the authors use contextual bandit for an audience maximization task. In (Gisselbrecht et al. 2015), a data capture task is tackled via non-contextual bandit algorithms. In that case, each user is assumed to own a stationary distribution and the process is supposed to find users with best means by trading Related Work In this section we first present some literature related to bandit learning problems, then we discuss about some data capture related tasks and finally present some applications of bandits for social media tasks. The multi-armed bandit learning problem, originally studied in (Lai and Robbins 1985) in its stationary form has been widely investigated in the literature. In this first instance, the agent has no access to side information on actions and assume stationary reward distributions. A huge variety of methods have been proposed to design efficient selection policies, with corresponding theoretical guarantees. The famous Upper Confidence Bound (UCB) algorithm proposed in (Auer, Cesa-Bianchi, and Fischer 2002) and other UCBbased algorithms ((Audibert, Munos, and Szepesvari 2007; Audibert and Bubeck 2009)) have already proven to solve the so-called stochastic bandit problem. This type of strategies keeps an estimate of the confidence interval related to each reward distribution and plays the action with highest upper confidence bound at each time step. In (Auer, CesaBianchi, and Fischer 2002), the authors introduce the contextual bandit problem, where the learner observes some features for every action before choosing the one to play, and propose the LinRel algorithm. Those features are used to better predict the expected rewards related to each action. More recently, the LinUCB algorithm, which improves the performance of LinRel has been formalized (Chu et al. 2011). Those two algorithms assume the expected reward of an action to be linear with respect to some unknown parameters of the problem to be estimated. In (Kaufmann, Cappe, and Garivier 2012), the authors propose the Bayes-UCB algorithm which unifies several variants of UCB algorithms. For both the stochastic and contextual bandit case, Thompson sampling algorithms, which introduce randomness on the exploration by sampling action parameters 132 off between the exploitation of already known good users to follow and the exploration of unknown ones. The proposed approach mainly differs from this work for the nonstationnarity assumptions it relies on. This work is considered in the experiments to highlight the performances of the proposed dynamic approach. 3 3.1 the set of user accounts that maximizes the sum of collected relevance scores: T max ra,t (1) (Kt )t=1..T Data Streams Selection Context As we described in the introduction, our goal is to propose a system aimed at capturing some relevant data from streaming APIs proposed by most of social media. Face to such APIs, two different types of data sources, users or keywords (or a mixture of both kinds), can be investigated. In this paper, we focus on users-centered streams. Users being the producers of the data, following them enables more targeted data capture than keywords. Following keywords to obtain data about a given topic for example would be likely to lead to the collection of a greatly too huge amount of data2 , which would imply very important post-filtering costs. Moreover, considering user-centered streams allows a larger variety of applications, with author-related aspects, and more complex topical models, than focusing on keywords-centered streams. Given that APIs usually limit the ability to simultaneously follow users’ activity to a restricted number of accounts, the aim is to dynamically select users that are likely to publish relevant content with regard to a predefined need. Note that the proposed approach would also be useful in the absence of such restriction, due to the tremendous amount of data that users publish on main social media. The major difficulty is then to efficiently select user accounts to follow given the huge amount of users that post content and that at the beginning of the process, no prior information is known about potential sources of data. Moreover, even if some specific accounts to follow could be found manually, adapting the capture to the dynamics of the media appears intractable by hand. For all these reasons, building an automatic solution to orient the user’s choices appears more than useful. To make things concrete, in the rest of the paper we set up in the case of Twitter. However, the proposed generic approach remains available for any social media providing real-time access to its users’ activity. 3.2 t=1 a∈Kt Note that, in our task, we are thus only focused on getting the maximal amount of relevant data, the precision of the retrieved data is not of our matter here. This greatly differs from usual tasks in information retrieval and followees recommendation for instances. Relevance scores considered in our data capture process depend on the information need of the user of the system. The data need in question can take various forms. For example, one might want to follow the activity of users that are active on a predefined topic or influent in the sense that their messages are reposted a lot. They can depend on a predefined function that automatically assigns a score according to the matching of the content to some requirements (see section 5 for details). Whereas (Gisselbrecht et al. 2015) relies on stationarity assumptions on the relevance score distributions of users for their task of focused data capture (i.e., relevance scores of each user account are assumed to be distributed around a given stationary mean along the whole process), we claim that this is a few realistic setting in the case of large social networks and that it is possible to better predict future relevance scores of users according to their current activity. In other words, with the activity of a user a ∈ K at iteration t − 1 represented by a d-dimensional real vector za,t , there exists a function h : Rd → R that explains from za,t the relevance score ra,t that a user a would obtain if it was followed during iteration t. This correlation function needs to be learned conjointly with the iterative selection process. However, in our case, maximizing the relevance scores as defined in formula 1 is constrained in some different ways: • Relevance scores ra,t are only defined for followed accounts during iteration t (i.e. for every a ∈ Kt ), others are unknown; • Context vectors za,t are only observed for a subset of users Ot (i.e., it is not possible to observe the whole activity of the social network). Constraints on context vectors are due to API restrictions. Two streaming APIs of Twitter are used to collect data: • On the one hand, a Sample streaming API furnishes realtime access to 1% of all public tweets. We leverage this API to discover unknown users and to get an important amount of contexts vectors for some active users; • On the other hand, a Follow streaming API provides realtime data published by a subset of the whole set of Twitter users. This API allows the system to specify a maximal number of 5000 users to follow. We leverage this API to capture focused data from selected users. For users in Ot , their relevance score at next iteration can be estimated via the correlation function h, which allows to capture variations of users’ usefulness. For others however, we propose to consider the stationary case, by assuming some general tendency on users’ usefulness. Therefore, A Constrained Decision Process As described above, our problem comes down to select, at each iteration t of the process, a subset Kt of k user accounts to follow, among the whole set of possible users K (Kt ⊆ K), according to their likelihood of posting relevant tweets for the formulated information need. Given a relevance score ra,t assigned to the content posted by user a ∈ Kt during iteration t of the process (the set of tweets he posted during iteration t), the aim is then to select at each iteration t over T 2 Note that on some social media such as Twitter, a limitation of 1% of the total amount of produced messages on the network is set. Considering streams centered on popular keywords usually leads to exceed this limitation and then, to ignore some important messages. 133 cess. At each round t ∈ {1, .., T } of the process, feature vectors za,t ∈ Rd can be associated with users a ∈ K. The set of k users selected to be followed during iteration t is denoted by Kt . The reward obtained by following a given user a during iteration t of the process is denoted by ra,t . Note that only rewards ra,t from users a ∈ Kt are known, others cannot be observed. In the field of contextual bandit problems, the usual linear hypothesis assumes the existence of an unknown vector β ∈ Rd such that ∀t ∈ {1, .., T } , ∀a ∈ K : E[ra,t |za,t ] = T za,t β (see (Agrawal and Goyal 2012b)). Here, in order to model the intrinsic quality of every available action (where an action corresponds in our case to the selection of a user to follow during the current iteration), we also suppose the existence of a bias term θa ∈ R for all actions a, such that finally: we get an hybrid problem, where observed contexts can be used to explain variations from estimated utility means. 3.3 System Description Potential users Feeds / mean updates Feeds / mean updates za5,t+1 za1,t Selection policy Selection policy za2,t za7,t+1 Learner Kt-1 Learner Kt za3,t za1,t+1 Follow API Ot Tweets Kt+1 za3,t+1 za5,t Follow API Selection policy za4,t+1 za4,t Ot+1 Tweets Relevance Feedback Follow API Tweets Relevance Feedback Relevance Feedback Iteration t Sample Streaming API Iteration t-1 Iteration t+1 T β + θa ∀t ∈ {1, .., T } , ∀a ∈ K : E[ra,t |za,t ] = za,t This formulation corresponds to a particular case of the LinUCB algorithm with the hybrid linear reward proposed in (Li et al. 2010) by taking a constant arm specific feature equal to 1 at every round. Note that individual parameters could have also been considered in our case but this is not well suited for the types of problems of our concern in this paper, where the number of available actions at each round is usually high (which would imply a difficult learning of individual parameters). For the sake of simplicity, we therefore restrict this individual modeling to this bias term. In our case, a major difference with existing works on contextual bandit problems is that every context is not accessible to the selection policy before it chooses users to follow. Here, we rather assume that every user a from the social media owns a probability pa (0 < pa < 1) to reveal its context3 . The set of users for which the process observes the features at time t is denoted Ot , whose complement is denoted Ōt . Then, since Kt ⊂ K and K = Ot ∪ Ōt , selected users can then belong to the set of users without observed context Ōt . Finally, as told above, while our approach would remain valid in a classical bandit setting where only one action is chosen at each step, our task requires to consider the case where multiple users can be selected to be followed at each iteration (since we wish to exploit the whole capture capacity allowed by the API). At each round, the agent has then to choose k (k < K) users among the K available ones, according to observed features and individual knowledge about them. The reward obtained after a period of capture then corresponds to the sum of individual rewards collected from the k users followed during this period. Figure 1: System illustration. Figure 1 depicts our system of focused data capture from social media. The process follows the same steps at each iteration, three time steps are represented on the figure. At the beginning of each iteration, the selection policy selects a set of users to follow (Kt ) among the pool of known users K, according to observations and some knowledge provided by a learner module. Then, the messages posted by the k selected users are collected via the Follow streaming API. As we can see in the central part of the illustration, after having followed users in Kt during the current iteration t, the collected messages are analyzed (by a human or an automatic classifier) to give feedbacks (i.e., relevance scores) to the learner. Meanwhile, as a parallel task during each iteration, current activity features are captured from the Sample streaming API. This allows us to feed the pool of potential users to follow K and to build the set Ot of users with observed contexts. Context vectors for iteration t are built by considering all messages that the system collected from the Follow streaming and the Sample streaming APIs for each user during this iteration. Therefore, every user whose at least one message was included in these collected messages at iteration t is included in Ot+1 (messages from a same author are concatenated to form za,t+1 , see section 5 for an example of context construction from messages). Contexts vectors collected during iteration t serve as input for the selection strategy at iteration t + 1. 4 Model and Algorithm 4.2 This section first introduces some settings and backgrounds for our contextual bandit approach for targeted data capture, details the policy proposed to efficiently select useful users to follow, and then describes the general algorithm of our system. 4.1 (2) Distribution assumptions To derive our algorithm, we first perform a maximum a posteriori estimate for the case where every context are available, based on the following assumptions: • Likelihood: reward scores are identically and independently distributed w.r.t. observed contexts: ra,t ∼ Problem setting As defined above, we denote by K the set of K known users that can be follow at each iteration of the data capture pro- 3 The case pa = 1 ∀a ∈ K corresponds to the traditional contextual bandit. 134 T N (za,t β + θa , λa ), with λa the variance of the difference between the reward ra,t and the linear application T za,t β + θa ; Theorem For any 0 < δ < 1 and za,t ∈ Rd , denoting √ 1 −1 α = 2erf (1 − δ) 5 , for every action a after t iterations: P |E[ra,t |za,t ]− μ̄a −(za,t − z̄a )T β̄| ≤ α σa ≥ (1−δ) 1 with σa = + (za,t − z̄a )T A−1 0 (za,t − z̄a ) τa + 1 (5) • Prior: the unknown parameters are normally distributed: β ∼ N (0, bId ) and θa ∼ N (0, ρa ), where b and ρa are two values allowing to control the variance of the parameters and Id is the identity matrix of size d (the size of the features vectors). Proof 2 E[ra,t |za,t ] = (za,t − z̄a )T β + θa + z̄aT β. By combining equations 2, 3 and 4: E[ra,t |za,t ] ∼ N μ̄a + (za,t − z̄a )T β̄, τa1+1 + (za,t − z̄a )T A−1 0 (za,t − z̄a ) . The announced result comes from the confidence interval of a Gaussian. For the sake of clarity, we set b, ρa and λa equal to 1 for all a in the following. Note that every results can be extended to more complex cases. Proposition 1 Denoting Ta the set of steps when user a has been chosen in the first n time steps of the process (Ta = {t ≤ n, a ∈ Kt }, |Ta | = τa ), ca the vector containing rewards obtained by a at iterations it has been followed (ca = (ra,t )t∈Ta ) and Da the context matrix related to user T a (Da = (za,t )t∈Ta ), the posterior distribution of the unknown parameters after n time steps, when all contexts are available, follows: (3) β ∼ N β̄, A−1 0 1 (4) θa + z̄aT β ∼ N μ̄a , τa + 1 This formula is directly used to find the so-called Upper Confidence Bound, which leads to choose the k users having the highest score values sa,t at round t, with: sa,t = μ̄a + (za,t − z̄a )T β̄ + α σa (6) We recall that this formula can be used in the traditional contextual bandit problem, where every context is available. In the next section, we propose a method to adapt it to our specific setting of data capture, where most of contexts are usually hidden from the agent. 4.3 With: A0 = Id + K (τa + 1)Σ̄a bT0 = a=1 Σ̄a = (τa + 1)ξ¯a cT Da ξ¯a = a − μ̄a z̄aT τa + 1 β̄ = Selection scores derived in the previous section are defined for cases where all context vectors are available for the agent at each round. However, in our case, only users belonging to the subset Ot show their context. This particular setting requires to decide how judging users whose context is unknown. Moreover, it also implies questions about knowledge updates when some users are selected without having been (contextually meaning) observed. Even though it is tempting to think that the common parameter β and the user specific ones θa can be learned independently, in reality they are complexly correlated as formula 3 and 4 highlight it. It is important to emphasize that to keep probabilistic guarantees, parameters should only be updated when a chosen user was also observed, i.e. when it belongs to Kt ∩Ot . So, replacing the previous Ta by Taboth = {t ≤ n, a ∈ Kt ∩ Ot } (we keep the notation |Taboth | = τa ) allows us to re-use the updates formula of Proposition 1. However, computing the selection score for a user whose context is hidden from the agent cannot be done via equation 6 since za,t is unknown. To deal with such a case we propose to use an estimate of the mean distribution of feature vectors of each user. The assumption is that, while non-stationarity can be captured when contexts are available, different users own different mean reward distributions, and that it can be useful to identify globally useful users. New notations: • We denote the set of steps when user a revealed its context vector after n steps by Taobs = {t ≤ n, a ∈ Ot } with |Taobs | = na . a=1 DaT Da − z̄a z̄aT τa + 1 T A−1 0 b0 K μ̄a = ra,t t∈Ta τa + 1 z̄a = za,t t∈Ta τa + 1 Proof 1 The full derivation is available at 4 . It proceeds as follows: denoting D = ((ra1 ,1 , za1 ,1 ), ...(ran ,n , zan ,n )) and using Bayes’s rule we have: p(β, θ1 ...θK |D) ∝ p(D|β, θ1 ...θK )p(β) K a=1 p(θa ) −1 2 T n (ra ,t −za β−θa )2 t t ,t λa T +βbβ+ K θ2 a ρa t=1 a=1 . ∝e Rearranging the terms and using matrix notations leads to: ...θK |D) p(β, θ1 −1 2 β T A0 β−2bT 0 β+ K T (τa +1)(θa +z̄a β−μ̄a )2 . ∝e The two distributions are directely derived from this formula. a=1 Dealing with Hidden Contexts All the parameters above do not have to be stored and can be updated efficiently as new learning example comes (see algorithm 1). erf −1 is the x −t2 2/π 0 e dx. 5 4 http://www-connex.lip6.fr/∼lampriers/ICWSM2016supplementaryMaterial.pdf 135 inverse error function, erf (x) = • The empirical mean of the feature vector for user a is de1 za,t . Note that ẑa is difnoted ẑa , with ẑa = na t∈Taobs ferent from z̄a since the first is updated every time the context za,t is observed while the second is only updated when user a is observed and played in the same iteration. sa,t = μ̄a + (zˆa − z¯a )T β̄ + α σ̂a + • Without loss of generality, assume that β is bounded by M ∈ R+∗ , i.e. ||β|| ≤ M , where || || stands for the L2 norm on Rd . • For any user a, at any time t the context vectors za,t ∈ Rd are iid sampled from an unknown distribution with finite mean vector E[za ] and covariance matrix Σa (both unknown). 4.4 Contextual Data Capture Algorithm The pseudo-code of our contextual data capture algorithm from social streams is detailled in algorithm 1. The algorithm starts by initializing the set of observed users Ot from the Sample API during a time period of L. Then, for each iteration t over T , it proceeds as follows: Note that the requirement ||β|| ≤ M can be satisfied through proper rescaling on β. Proposition 2 The excepted value of the mean reward of user a with respect to its features distribution, denoted E[ra ] satisfies after t iterations of the process: 1. Lines 10 to 15: every user from Ot that does not belong to the pool of potential users to follow is added to K, in the limit of newM ax new arms at each iteration to cope with cases where too many new arms are discovered each iteration (in that case only the newM ax first new ones are considered, others are simply ignored). This allows to avoid over-exploration for such cases; (7) = E[za ] β + θa T = (zˆa − z¯a )T β + θa + z¯a T β + (E[za ] − zˆa )T β Theorem 2 Given 0 < δ < 1 and 0 < γ < 12 , denoting √ α = 2erf −1 (1 − δ), for every user a after t iterations of the process: 1 P |E[ra ] − μ̄a − (zˆa − z¯a ) β̄| ≤ α σ̂a + γ ≥ na Ca (1 − δ) 1 − 1−2γ na 1 with σ̂a = + (zˆa − z¯a )T A−1 0 (zˆa − z¯a ) τa + 1 (9) Given that γ > 0, and that each user a has a probability 0 < pa < 1 to reveal its context at each time step t, the extra-exploration term 1/nγa tends to 0 as the number of observations of user a gets larger. The above score then tends to a classical LinUCB score (as in equation 6) in which the context vector za,t is replaced by its empirical mean zˆa , as t increases. Main assumptions: T β + θa ] E[ra ] = E[za,t 1 nγa 2. Lines 16 to 20: updates of empirical context means ẑa (according to data collected from user a at the previous iteration) and observation counts na for every observed user a in Ot ; 3. Line 22: parameters β̄ of the estimation model are updated according to current counts; T 4. Lines 23 to 37: selection scores are computed for every user in the pool K. As detailed in the previous section, selection scores are computed differently depending on the availability of the current context of the corresponding user. Note also that the selection scores of new users added to the pool during the current iteration are set to +∞ in order to enforce the system to choose them at least once and initialize their empirical mean reward; (8) Where Ca is a positive constant specific to each user. Proof 3 The full proof is available at URL4 , we only give the main steps here. Denoting m̂a = (zˆa − z¯a )T β̄ + μ̄a , σ̂a2 = τa1+1 +(zˆa − z¯a )T A−1 0 (zˆa − z¯a ), Xa the random variable such that Xa = (zˆa − z¯a )T β+θa +z¯a T β and using Cauchy Schwarz inequality, we have: |E[ra ]−m̂a | zˆa ||M ≤ |Xaσ−am̂a | + ||E[za ]− Then, using the Gaussian σa σa property of Xa , we show that: P |Xaσ−am̂a | ≤ α = 1 − δ, with √ α = 2erf −1 (1 − δ). On the other hand, with Chebyshev inequality, we prove that: zˆa ||M 1 M2d ≥ 1 − ≤ T race(Σ ) P ||E[za ]− γ a 1−2γ σˆa σˆa na na Finally, combining the two previous results proves the theorem. 5. Line 38: selection of the k users with the best selection scores; 6. Lines 39 to 43: simultaneous capture of data from the Sample and Follow Apis during a time period of L. The Follow API is focused on the subset of k selected users Kt ; 7. Lines 44 to 55: computation of the rewards of users in Kt according to the data collected from the Follow API. Then, using users in Kt ∩ Ot , both the global model parameters (A0 and b0 ) and arm specific ones (τa , μ̄a and z̄a ) are updated; For γ < 12 , the previous probability tends to 1 − δ as the number of observations of user a increases (as in equation 5). Then, the inequality above gives a reasonably tight UCB for the expected payoff of user a, from which a UCB-type user-selection strategy can be derived. At each trial t, if the context of user a has not been observed (i.e. a ∈ / Ot ), set: 8. Line 57: Update of the set of observed users by gathering all users whose at least one post has been collected from both APIs during the current iteration. 136 5 Algorithm 1: Contextual Data Capture Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Classical bandit algorithms usually come with an upper bound of the regret, which corresponds to the difference between the cumulative reward obtained with an optimal policy and the bandit policy in question. In our case, an optimal strategy corresponds to a policy for which the agent would have a perfect knowledge of the parameters β and θa (for all a) plus an access to all the features at each step. However, in our case, in the absence of an asymptotically exact T β + θa , due to the fact that some contexts estimate of za,t za,t are hidden from the process, it is unfortunately not possible to show a sub-linear upper-bound of the regret. Nevertheless, we claim that the algorithm we proposed to solve our problem of contextual data capture from social media streams well behaves on average. To demonstrate this good behavior, we present various offline and online experiments in real-world scenarios. Our algorithm is the first to tackle the case of bandits with partially hidden contexts, the aim is to demonstrate its feasibility for constrained tasks such as data capture from social media streams. Input: k, T , α, γ, L, maxN ew A0 ← Id×d (identity matrix of dimension d); b0 ← 0d (zero vector of dimension d); K ← ∅; Ω1 ← Data collected from Sample API during a period of L; O1 ← Authors of at least one post in Ω1 ; for t ← 1 to T do nbN ew ← 0; for a ∈ Ot do if a ∈ / K and nbN ew < maxN ew then nbN ew ← nbN ew + 1; K ← K ∪ {a}; τa ← 0; na ← 0; sa ← 0; ba ← 0d ; μ̄a ← 0; z̄a ← 0d ; ẑa ← 0d ; 15 16 17 18 19 20 21 22 23 24 25 26 end if a ∈ K then Observe context za,t from Ωt ∪ Ψt ; Update ẑa w.r.t. za,t ; na ← na + 1; end end else 1 τa +1 + (zˆa − z¯a )T A−1 0 (zˆa − z¯a ); 30 σ̂a ← 31 sa,t ← μ̄a + (zˆa − z¯a )T β̄ + α σ̂a + 32 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 1 γ na ; end end else sa,t ← ∞; end end Kt ← arg max K̂⊆K,|K̂|=k Experimental Setup Beyond a random policy that randomly selects the set of users Kt at each iteration, we compare our algorithm to two bandit policies well fitted for being applied for the task of data capture: CUCB and CUCBV respectively proposed in (Qin, Chen, and Zhu 2014) and (Gisselbrecht et al. 2015). These algorithms do not take features into account and perform stationary assumptions on reward distributions of users. Compared to CUCB, CUCBV adds the variance of reward distributions of users in the definition of the confidence intervals. This has been shown to behave well in cases such as our task of data capture, where a great variability can be observed due to the possible inactivity of users when they are selected by the process. At last, we consider a naive version of our contextual algorithm that preferably selects users with an observed context (it only considers users in Ot when this set contains enough users), rather than getting the ability of choosing users in Ōt as it is the case in our proposal. For all of the reported experiments we set: 1) The exploration parameter to α = 1.96, which corresponds to a 95 percent confidence interval on the estimate of the expected reward when contexts are observed; 2) The reduction parameter of confidence interval for unknown contexts to γ = 0.25, which allows a good trade-off between the confidence interval reduction rate and the probability of this interval; 3) The number of new users that can be added to the pool at each iteration to newM ax = 500, to avoid over-exploration, especially for online experiments, when many new users can be discovered from the Sample API; sa,t ← μ̄a + (za,t − z̄a )T β̄ + α σa ; 27 37 5.1 end β̄ ← A−1 0 b0 ; for a ∈ K do if τa > 0 then if a ∈ Ot then 1 T −1 σa ← τa +1 + (za,t − z̄a ) A0 (za,t − z̄a ); 28 29 33 34 35 36 Experiments sa,t ; a∈K̂ during a time interval of L do Ωt ← Collect data from the Sample API; Ψt ← Collect data from the Follow API focused on the users in Kt ; end for a ∈ Kt do Compute ra,t from Ψt ; if a ∈ Ot then A0 ← A 0 + b a b T a /(τa + 1); b0 ← b0 + ba sa /(τa + 1); τa ← τa + 1; sa ← sa + ra,t ; μ̄a ← sa /(τa + 1); ba ← ba + za,t ; z̄a ← ba /(τa + 1); T − b a bT A0 ← A0 + za,t za,t a /(τa + 1); b0 ← b0 + ra,t za,t − ba sa /(τa + 1) 5.2 Offline experiments Datasets In order to be able to test different policies and simulate a real time decision process several times, we first propose a set of experiments on offline datasets: end end Ωt+1 ← Ωt ; Ψt+1 ← Ψt ; Ot+1 ← Authors of at least one post in Ωt+1 ∪ Ψt+1 ; • USElections: dataset containing a total of 2148651 messages produced by 5000 users during the ten days preceding the US presidential elections in 2012. The 5000 cho- 58 end 137 sen accounts are the first ones who used either “Obama”, “Romney” or “#USElections”. Results For space reasons and brevity, we only report results for k = 100, L = 2min (which means that every 2 minutes the algorithms select 100 users to follow) for the USElections dataset and the four rewards defined above. For the Libya dataset, we only show the results for the politics reward. Similar tendencies were observed for different values of k and L. Figures 2 and 3 represent the evolution of cumulative reward for different policies and rewards, respectively for the USElections and the Libya dataset. For contextual algorithms, their naive versions that select observed users in priority are given with same symbols without solid line. First, it should be noticed that every policy performs better than the Random one, which is a first element to assert the relevance of bandit algorithms for the task in concern. Second, we notice that CUCBV performs better than CUCB, which confirms the results obtained in (Gisselbrecht et al. 2015) on a similar task of focused data capture. More interesting is the fact that when every context is observable, our contextual algorithm performs better than stationary approaches CUCB and CUCBV. This result shows that we are able to better anticipate which user is going to be the more relevant at the next time step, for a particular information need, given what he said right before. This also confirms the usual non-stationarity behavior of users. For instance, users can talk about science during the day at work while being more focused on sports when they come back home. Considering contexts also allows one to converge faster towards interesting users since every user account share the same β parameter. Results show that even for low probabilities of context observation p, our contextual policy behaves greatly better than non-contextual approaches, which empirically validates our approach: it is possible to leverage contextual information even if a large part of this information is hidden from the system. In particular, for the Libya dataset where no significant difference between the two non-contextual CUCB and CUCBV is noteworthy, our algorithm seems much more appropriate. Moreover, for a fixed probability p, the naive versions of the contextual algorithm offer lower performances than the original ones. By giving the ability of selecting users even if their context is unknown, we allow the algorithm to complete its selection by choosing users whose average context corresponds to a learned profile. If no user in Ot seems currently relevant, the algorithm can then rely on users with good averaged intrinsic quality. For k = 100, the average number of selected users for which the context was observed at each time step is 43 for p = 0.1 and 58 for p = 0.5, which confirms that the algorithm does not always select users in Ot . • Libya: dataset containing 1211475 messages from 17341 users. It is the result of a three-months capture from the F ollow API using the keyword “Libya”. In the next paragraph, we describe both how we transform messages in feature vectors and which reward function we use to evaluate the quality of a message. Model definition Context model The content obtained by capturing data from a given user a at time step t is denoted ωa,t (if we get several messages for a given author, these messages are concatenated). Given a dictionary of size m, messages can be represented as m dimensional bag of word vectors. However the size m might cause the algorithm to be computationally inefficient since it requires the inversion of a matrix of size m at each iteration. In order to reduce the dimension of those features, we used a Latent Dirichlet Allocation method (Blei, Ng, and Jordan 2003), which aims at modeling each message as a mixture of topics. However, due to the short size of messages, that standard LDA may not work well on Twitter (Weng et al. 2010). To overcome this difficulty, we choose the approach proposed in (Hong and Davison 2010), which aggregates tweets of a same user in one document. We choose a number of d = 30 topics and learn the model on the whole corpus. Then, if we denote by F : Rm −→ Rd the function that, given a message returns its representation in the topic space, the features of user a at time t is za,t = F (ωa,t−1 ). Reward model We trained a SVM topic classifier on the 20 Newsgroups dataset in order to rate each content. For our experiments, we focus on 4 classes to test: politics, religion, sport, science. We propose to consider the reward associated to some content as the number of times it has been retweeted (re-posted on Tweeter) by other users if it belongs to the specified class according to our classifier, or 0 otherwise. Finally, if a user posted several messages during an iteration, his reward ra,t corresponds to the sum of the individual rewards obtained by the messages he posted during iteration t. This instance of reward function corresponds to a task of seeking to collect messages, related to some desired topic, that will have a strong impact on the network (note that we cannot directly get high-degree nodes in the user graph due to API restrictions). 5.3 Experimental Settings In order to obtain generalizable results, we test different values for parameters p, the probability for every context to be observed, k, the number of users that can be followed simultaneously, and L, which stands for the duration of an iteration. More precisely, we experienced every possible combination with p ∈ {0.1, 0, 5, 1.0}, k ∈ {50, 100, 150} and L ∈ {2min, 3min, 6min, 10min}. Online experiments Experimental Settings We propose to consider an application of our approach on a real-world online Twitter scenario, which considers both the whole social network and the API constraints. For these experiments, we use the full potential of Twitter APIs, namely 1% of all public tweets for the Sample API that runs in background, and 5000 users simultaneously followed (i.e., k = 5000) for the Follow API. 138 4000 10000 Cumulative Reward Cumulative Reward 5000 25000 Random CUCB CUCBV Contextual p=0.1 Contextual p=0.5 Contextual p=1.0 Contextual Naive p=0.1 Contextual Naive p=0.5 3000 8000 18000 Random CUCB CUCBV Contextual p=0.1 Contextual p=0.5 Contextual p=1.0 Contextual Naive p=0.1 Contextual Naive p=0.5 20000 6000 4000 15000 Random CUCB CUCBV Contextual p=0.1 Contextual p=0.5 Contextual p=1.0 Contextual Naive p=0.1 Contextual Naive p=0.5 16000 14000 Cumulative Reward 12000 Random CUCB CUCBV Contextual p=0.1 Contextual p=0.5 Contextual p=1.0 Contextual Naive p=0.1 Contextual Naive p=0.5 6000 Cumulative Reward 7000 10000 12000 10000 8000 6000 2000 2000 1000 0 4000 5000 2000 0 1000 2000 3000 4000 5000 Time step 6000 7000 8000 0 9000 0 1000 2000 3000 4000 5000 Time step 6000 7000 8000 9000 0 0 1000 2000 3000 4000 5000 Time step 6000 7000 8000 9000 0 0 1000 2000 3000 4000 5000 Time step 6000 7000 8000 9000 Figure 2: Cumulative Reward w.r.t time step for different policies on the USElections dataset. From left to right, the considered topics for reward computations are respectively: politics, religion, science and sport. In order to analyze the behavior of the experimented policies during the first iterations of the capture, we also plot on the right of the figure a zoomed version of the same curves on the first 150 time steps. At the beginning of the process, the Static policy performs better than every others, which can be explained by two reasons: 1) bandit policies need to select every user at least once in order to initialize their scores and 2) users who are part of the Static policy’s pool are supposed to be relatively good targets considering the way we chose them. Around iteration 80, both CUCBV and our Contextual algorithm become better than Static policy, which corresponds to the moment they start to trade off between exploration and exploitation. Then, after a period of around 60 iterations (approximately between the 80th and the 140th time step) where CUCBV and our Contextual approach behave similarly, the latter then allows the capture process to collect greatly more valuable data for the specified need. This is explained by the fact that our Contextual algorithm requires a given amount of iterations to learn the correlation function between contexts and rewards and then taking some advantage of it. Note the significant changes in the slope of the cumulative reward curve of our Contextual algorithm, which highlight the ability of the algorithm to be reactive to the environment changes. Finally, the number of times every user has been selected by our Contextual algorithm is more spread than with a stationary algorithm such as CUCBV, which confirms the relevance of our dynamic approach. To conclude, in both offline and online settings, our approach proves very efficient at dynamically selecting user streams to follow for focused data capture tasks. 800 Random CUCB CUCBV Contextual p=0.1 Contextual p=0.5 Contextual p=1.0 Contextual Naive p=0.1 Contextual Naive p=0.5 700 Cumulative Reward 600 500 400 300 200 100 0 0 2000 4000 6000 8000 Time step 10000 12000 14000 16000 Figure 3: Cumulative Reward w.r.t time step for different policies on the Libya dataset and the politics reward. At each iteration, the selected users are followed during L = 15minutes. The messages posted by these users during this interval are evaluated with the same relevance function as in the offline experiments, using the politics topic. We also use an LDA transformation to compute the context vectors. Given Twitter policy, every experiment requires one Twitter Developer account, which limits the number of strategies we can experiment. We chose to test the four following ones: our contextual approach, CUCBV, Random and a another one called Static. This Static policy follows the same 5000 accounts at each round of the process. Those 5000 accounts have been chosen by collecting all tweets provided during 24 hours by the Sample streaming API, evaluating them and finally taking the 5000 users with the greatest cumulative reward. For information, some famous accounts such that @dailytelegraph, @Independent or @CNBC were part of them. 6 Conclusion In this paper, we tackled the problem of capturing relevant data from social media streams delivered by specific users of the network. The goal is to dynamically select useful userstreams to follow at each iteration of the capture, with the aim of maximizing some reward function depending on the utility of the collected data w.r.t. the specified data need. We formalized this task as a specific instance of the contextual bandit problem, where instead of observing every feature vectors at each iteration of the process, each of them has a certain probability to be revealed to the learner. For solving this task, we proposed an adaptation of the popular con- Results Figure 4 on the left represents the evolution of the cumulative reward of a two-weeks-long run for the four tested policies. From these curves, we note the very good behavior of our algorithm on a real-world scenario, as the amount of rewards it accumulates grows greatly much more quickly than other policies, especially after the 500 first iterations of the process. After these first iterations, our algorithm appears to have acquired a good knowledge on the reward distributions, w.r.t. observed contexts and over the various users of the network. 139 700000 7000 Random CUCBV Contextual Static 600000 5000 Cumulative Reward Cumulative Reward 500000 400000 300000 200000 4000 3000 2000 100000 0 Random CUCBV Contextual Static 6000 1000 0 200 400 600 800 Time step 1000 1200 0 1400 0 20 40 60 80 Time step 100 120 140 160 Figure 4: Cumulative Reward w.r.t time step for different policies on the live experiment. The figure represents the whole experiment on the left and a zoom on the 150 first steps on the right. textual bandit algorithm LinU CB to the case where some contexts are hidden at each iteration. Although not any sublinear upper-bounds of the regret can be guaranteed, because of the great uncertainty induced by the hidden contexts, our algorithm behaves well even when the proportion of hidden contexts is high, thanks to some assumptions made on the hidden contexts and a well fitted exploration term. This corresponds to an hybrid stationary / non stationary bandit algorithm. Experiments show the very good performances of our proposal to collect relevant data under real-world capture scenarios. It opens the opportunity for defining new kinds of online intelligent strategies for collecting and tracking focused data from social media streams. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandits with linear payoff functions. In AISTATS. Colbaugh, R., and Glass, K. 2011. Emerging topic detection for business intelligence via predictive analysis of ’meme’ dynamics. In AAAI Spring Symposium. Fogaras, D.; Rácz, B.; Csalogány, K.; and Sarlós, T. 2005. Towards scaling fully personalized pageRank: algorithms, lower bounds, and experiments. Internet Math. Gisselbrecht, T.; Denoyer, L.; Gallinari, P.; and Lamprier, S. 2015. Whichstreams: A dynamic approach for focused data capture from large social media. In ICWSM. Gupta, P.; Goel, A.; Lin, J.; Sharma, A.; Wang, D.; and Zadeh, R. 2013. Wtf: The who to follow service at twitter. In WWWW. Hannon, J.; Bennett, M.; and Smyth, B. 2010. Recommending twitter users to follow using content and collaborative filtering approaches. In RecSys. Hong, L., and Davison, B. D. 2010. Empirical study of topic modeling in twitter. In ECIR. Kaufmann, E.; Cappe, O.; and Garivier, A. 2012. On bayesian upper confidence bounds for bandit problems. In ICAIS. Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson sampling: An asymptotically optimal finite-time analysis. In ALT. Kohli, P.; Salek, M.; and Stoddard, G. 2013. A fast bandit algorithm for recommendation to users with heterogenous tastes. In AAAI. Lage, R.; Denoyer, L.; Gallinari, P.; and Dolog, P. 2013. Choosing which message to publish on social networks: A contextual bandit approach. In ASONAM. Lai, T., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1):4 – 22. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW. Li, R.; Wang, S.; and Chang, K. C.-C. 2013. Towards social data platform: Automatic topic-focused monitor for twitter stream. Proc. VLDB Endow. Qin, L.; Chen, S.; and Zhu, X. 2014. Contextual combinatorial bandit and its application on diversified online recommendation. In SIAM. Weng, J.; Lim, E.-P.; Jiang, J.; and He, Q. 2010. Twitterrank: Finding topic-sensitive influential twitterers. In WSDM. Acknowledgments This research work has been carried out in the framework of the Technological Research Institute SystemX, and therefore granted with public funds within the scope of the French Program “Investissements d’Avenir”. Part of the work was supported by project Luxid’x financed by DGA on the Rapid program. References Agrawal, S., and Goyal, N. 2012a. Analysis of thompson sampling for the multi-armed bandit problem. In COLT. Agrawal, S., and Goyal, N. 2012b. Thompson sampling for contextual bandits with linear payoffs. CoRR. Audibert, J.-Y., and Bubeck, S. 2009. Minimax policies for adversarial and stochastic bandits. In COLT. Audibert, J.-Y.; Munos, R.; and Szepesvari, C. 2007. Tuning bandit algorithms in stochastic environments. In ALT. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. Buccapatnam, S.; Eryilmaz, A.; and Shroff, N. B. 2014. Stochastic bandits with side observations on networks. In SIGMETRICS. Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In NIPS. Curran Associates, Inc. Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial multiarmed bandit: General framework and applications. In ICML. 140