User Modeling in Search Engine Logs Hongning Wang, Advisort: ChengXiang Zhai, Department of Computer Science, University of Illinois at Urbana-Champaign Urbana, IL 61801 USA {wang296,czhai}@Illinois.edu A Non-parametric Bayesian Approach [WSDM’14] A Ranking Model Adaptation Approach [SIGIR’13] In this work, we study the problem of user modeling in the search log data and propose a generative model, dpRank, within a non-parametric Bayesian framework. By postulating generative assumptions about a user's search behaviors, dpRank identifies each individual user's latent search interests and his/her distinct result preferences in a joint manner. Experimental results on a large-scale news search log data set validate the effectiveness of the proposed approach, which not only provides in-depth understanding of a user's search intents but also benefits a variety of personalized applications. Methods 𝜇𝑘𝑡 ~𝑁(𝜇0 , 𝜎02 ) In this work, we propose a general ranking model adaptation framework for personalized search. The proposed framework quickly learns to apply a series of linear transformations, e.g., scaling and shifting, over the parameters of the given global ranking model such that the adapted model can better fit each individual user's search preferences. Extensive experimentation based on a large set of search logs from a major commercial Web search engine confirms the effectiveness of the proposed method compared to several state-of-the-art ranking model adaptation methods. Methods 2 𝜎𝑘𝑡 ~𝐺𝑎𝑚𝑚𝑎(𝛼0 , 𝛽0 ) 𝛽𝑘𝑣 ~𝑁(0, 𝑎02 ) • Adjust the generic ranking model’s parameters with respect to each individual user’s ranking preferences Dirichlet Process Prior y y 𝑂(𝑉 2 ) p(Q) (𝜇1 , 𝜎12 , 𝛽1 ) p(Q) 2 (𝜇𝑘 , 𝜎𝑘 , 𝛽𝑘 ) p(Q) 𝑓 𝑥 = 𝑤𝑇𝑥 Aggregated level: information shared by all the users (𝜇𝑐 , 𝜎𝑐2 , 𝛽𝑐 ) x x 𝑓 𝑢 𝑥 = 𝐴𝑢 𝑤 𝑠 𝑇 𝑥 𝑂(𝑉) Modeling of search interest 𝑝 𝑞𝑖 ~𝑁(𝜇𝑖 , 𝜎𝑘2 𝐼) Modeling of result preferences Latent User Groups 𝜋𝑘 ∞ 𝑘=1 ~𝐷𝑃(𝛾, 𝜂) …… …… f1 f1 f1 𝑝 𝐷 𝑞𝑖 = 𝜋1 𝜋2 𝜋3 𝜋𝑒 𝑦𝑖𝑠 >𝑦𝑖𝑡 1 − 𝑏 𝜋𝑒 𝑏𝜋𝑒 Group 1 f2 Group k f2 𝑢 Individual level: characterize user’s own interest 𝜋 𝑢3 … … … Clicks 𝑢 𝐴 = 𝑢 𝑎𝑔 1 0 ⋯ 0 𝑎𝑔𝑢 2 ⋮ 0 ⋮ ⋯ ⋯ ⋱ 𝑢 𝑏𝑔 1 𝑎𝑔𝑢 2 ⋮ 𝑎𝑔𝑢 𝑉 𝑏𝑔𝑢 1 • Linear regression based model adaptation min 𝐿𝑎𝑑𝑎𝑝𝑡 𝐴 𝑢 f2 Group c 𝜋 𝑢2 𝜋 𝑢1 1 1 + exp(−𝛽𝑘𝑡 (𝑑𝑖𝑠 − 𝑑𝑖𝑡 )) Timestamp Query 5/29/2012 14:06:04 coney island Cincinnati 5/30/2012 12:12:04 drive direction to coney island 5/31/2012 19:40:38 motel 6 locations 5/31/2012 19:45:04 Cincinnati hotels near coney island 𝐴 𝑢 = 𝐿 𝑄 ;𝑓 𝑢 𝑢 + 𝜆𝑅(𝐴 ) Induced optimization 𝑤ℎ𝑒𝑟𝑒 𝑓 𝑢 𝑥 = 𝐴𝑢 𝑤 𝑠 𝑇 𝑥 𝑎𝑛𝑑 𝑤 𝑠 = (𝑤 𝑠 , 1) problem in the same Lose function from any linear complexity as the Complexity of adaptation learning-to-rank algorithm, e.g., original problem RankNet, LambdaRank, RankSVM • Instantiation of RankSVM A fully generative model for exploring users’ search behaviors 1. Draw latent user groups from DP: 2 ~𝐺𝑎𝑚𝑚𝑎(𝛼0 , 𝛽0 ) 𝛽𝑘𝑣 ~𝑁(0, 𝑎02 ) 𝜇𝑘𝑡 ~𝑁(𝜇0 , 𝜎02 ) 𝜎𝑘𝑡 2. Draw group membership for each user from DP: 𝜋𝑘 ∞ 𝑘=1 ~𝐷𝑃(𝛾, 𝜂) 3. To generate a query in user u: 3.1 Draw a latent user group c: 𝑐𝑖 ~𝜋𝑢 2 𝑝 𝑞 ~𝑁(𝜇 , 𝜎 3.2 Draw query qi for user u accordingly: 𝑖 𝑘 𝑘 𝐼) 3.3 Draw click preferences for qi accordingly: Gibbs sampling for posterior inference 𝑝 𝐷𝑖 𝑞𝑖 = 𝑦𝑖𝑠 >𝑦𝑖𝑡 • Document ranking 1 • 𝑠 𝑑𝑗𝑡 , 𝑞𝑗 = Experimental Results |𝑆| • Yahoo! News search logs • May to July, 2011 • 65 ranking features for each Query-Document pair • Query distribution in latent user groups Group 10 Top Ranked Queries 𝑠∈𝑆 𝑝 𝑐 𝑠 𝑘 P@1 P@3 MRR 0.487 0.616 0.622 0.617 0.298 0.446 0.459 0.449 0.220 0.283 0.283 0.281 0.501 0.632 0.638 0.632 dpRank 0.642 0.485 0.290 site authority proximity in title• Click preferences in latent user groups query match in title 0.658 URSVM GRSVM TRSVM IRSVM today in history, nascar 2011 schedule, today history, this day in history 9 miami heat, los angeles lakers, liverpool football club, arsenal football, nfl lockout 8 los angeles lakers, arsenal football, the dark knight rises, transformers 3, manchester united 8 the titanic, the bachelorette, cars 2, hangover 2, the voice 6 tree of life, game of thrones, sonic the hedgehog, world of warcraft, mtv awards 2011 casey anthony trial, casey anthony jurors, casey anthony, crude oil prices, air france flight 447 2 +C 𝜉𝑖𝑗𝑙 𝑞𝑖 fake tupac story, pbs hackers, alaska earthquake, southwest pilot, arizona wildfires 1 2 selena gomez, lady gaga, britney spears, jennifer aniston, taylor swift 0 1 iran, china, libya, vietnam, Syria Global model 𝑗,𝑙 𝑘 𝐾1 𝑥𝑡 , 𝑥𝑟 1 = 𝜎 User Set 0 •0.2 User Class # Queries # Documents - 49,782 2,320,711 34,827 187,484 1,744,969 % Population [10, ∞) queries Heavy 6.8 [5, 10) queries Medium 14.9 (0, 5) queries 78.3 •0.4 •0.6 2 4 6 8 Feature ID 10 12 𝑔 𝑣 =𝑘 𝑤𝑣𝑠 𝑥𝑟𝑣 𝑔 𝑣 =𝑘 𝑥𝑡𝑣 𝑘 𝑔 𝑣 =𝑘 𝑥𝑟𝑣 𝑔 𝑣 =𝑘 • Query-level improvement against global model # Users Annotation Set 0.2 0 𝑤𝑣𝑠 𝑥𝑡𝑣 𝑤ℎ𝑒𝑟𝑒 𝐾1 𝑥𝑡 , 𝑥𝑟 = • Adaptation efficiency per-user basis adaptation baseline 3 3 Non-linear kernels 𝑠. 𝑡. 0 ≤ 𝛼𝑡 ≤ 𝐶, ∀𝑡 0.4 4 2 𝑡 𝜉𝑖𝑗𝑙 ≥ 0 𝑤ℎ𝑒𝑟𝑒 𝑦𝑖𝑗 > 𝑦𝑖𝑙 𝑎𝑛𝑑 Δ𝑥𝑖𝑗𝑙 = 𝑥𝑖𝑗 − 𝑥𝑗𝑙 5 joplin missing, apple icloud, sony hackers, google subpoena, ford transmission 𝛼 1 − 𝑓 𝑥𝑡 • User-level improvement against global model 6 4 max 𝑠. 𝑡. 𝑤 𝑇 Δ𝑥𝑖𝑗𝑙 ≥ 1 − 𝜉𝑖𝑗𝑙 , ∀𝑞𝑖 , 𝑥𝑖𝑗 , 𝑥𝑖𝑙 7 Group ID 7 1 min w 𝑤,𝜉𝑖𝑗𝑙 2 1 𝑇 𝛼𝑡 − 𝛼 𝐾1 𝑥, 𝑥 + 𝐾2 𝑥, 𝑥 𝛼 2 𝑠 • Bing query log: May 27, 2012 – May 31, 2012 • 1830 ranking features 10 document age Pairwise ranking model Experimental Results 𝑠 = 𝑘 𝑞𝑗 𝛽𝑘 𝑑𝑗𝑡 MAP 9 5 1 1 + exp(−𝛽𝑘𝑡 (𝑑𝑖𝑠 − 𝑑𝑖𝑡 )) Margin rescaling 14 •0.8 Light Method RA Cross RA Cross RA Cross ΔMAP ΔP@1 0.1843 0.3309 0.1998 0.3523 0.1102 0.2129 0.1494 0.2561 0.0042 0.0575 0.0403* 0.0894* ΔP@3 0.0120 0.0182 0.0025 0.0208 -0.0221 -0.0021 ΔMRR 0.1832 0.1994 0.1103 0.1500 0.0041 0.0406* * Indicates p-value<0.01 Use cross-training to determine feature grouping