Poster - University of Illinois at Urbana

advertisement
User Modeling in Search Engine Logs
Hongning Wang, Advisort: ChengXiang Zhai,
Department of Computer Science, University of Illinois at Urbana-Champaign Urbana, IL 61801 USA
{wang296,czhai}@Illinois.edu
A Non-parametric Bayesian Approach [WSDM’14]
A Ranking Model Adaptation Approach [SIGIR’13]
In this work, we study the problem of user modeling in the search log data and propose a generative model,
dpRank, within a non-parametric Bayesian framework. By postulating generative assumptions about a user's
search behaviors, dpRank identifies each individual user's latent search interests and his/her distinct result
preferences in a joint manner. Experimental results on a large-scale news search log data set validate the
effectiveness of the proposed approach, which not only provides in-depth understanding of a user's search
intents but also benefits a variety of personalized applications.
Methods
𝜇𝑘𝑡 ~𝑁(𝜇0 , 𝜎02 )
In this work, we propose a general ranking model adaptation framework for personalized search. The
proposed framework quickly learns to apply a series of linear transformations, e.g., scaling and shifting,
over the parameters of the given global ranking model such that the adapted model can better fit each
individual user's search preferences. Extensive experimentation based on a large set of search logs from
a major commercial Web search engine confirms the effectiveness of the proposed method compared to
several state-of-the-art ranking model adaptation methods.
Methods
2
𝜎𝑘𝑡
~𝐺𝑎𝑚𝑚𝑎(𝛼0 , 𝛽0 ) 𝛽𝑘𝑣 ~𝑁(0, 𝑎02 )
• Adjust the generic ranking model’s parameters with respect to each individual user’s
ranking preferences
Dirichlet Process Prior
y
y
𝑂(𝑉 2 )
p(Q)
(𝜇1 , 𝜎12 , 𝛽1 )
p(Q)
2
(𝜇𝑘 , 𝜎𝑘 , 𝛽𝑘 )
p(Q)
𝑓 𝑥 = 𝑤𝑇𝑥
Aggregated level: information
shared by all the users
(𝜇𝑐 , 𝜎𝑐2 , 𝛽𝑐 )
x
x
𝑓 𝑢 𝑥 = 𝐴𝑢 𝑤 𝑠 𝑇 𝑥
𝑂(𝑉)
Modeling of search interest
𝑝 𝑞𝑖 ~𝑁(𝜇𝑖 , 𝜎𝑘2 𝐼)
Modeling of result preferences
Latent User Groups
𝜋𝑘 ∞
𝑘=1 ~𝐷𝑃(𝛾, 𝜂)
……
……
f1
f1
f1
𝑝 𝐷 𝑞𝑖 =
𝜋1 𝜋2 𝜋3 𝜋𝑒
𝑦𝑖𝑠 >𝑦𝑖𝑡
1 − 𝑏 𝜋𝑒 𝑏𝜋𝑒
Group 1
f2
Group k
f2
𝑢
Individual level: characterize
user’s own interest
𝜋 𝑢3
…
…
…
Clicks
𝑢
𝐴 =
𝑢
𝑎𝑔 1
0
⋯
0
𝑎𝑔𝑢 2
⋮
0
⋮
⋯
⋯
⋱
𝑢
𝑏𝑔 1
𝑎𝑔𝑢 2
⋮
𝑎𝑔𝑢 𝑉
𝑏𝑔𝑢 1
• Linear regression based model adaptation
min
𝐿𝑎𝑑𝑎𝑝𝑡 𝐴
𝑢
f2
Group c
𝜋 𝑢2
𝜋 𝑢1
1
1 + exp(−𝛽𝑘𝑡 (𝑑𝑖𝑠 − 𝑑𝑖𝑡 ))
Timestamp
Query
5/29/2012 14:06:04
coney island Cincinnati
5/30/2012 12:12:04
drive direction to coney island
5/31/2012 19:40:38
motel 6 locations
5/31/2012 19:45:04 Cincinnati hotels near coney island
𝐴
𝑢
= 𝐿 𝑄 ;𝑓
𝑢
𝑢
+ 𝜆𝑅(𝐴 )
Induced optimization 𝑤ℎ𝑒𝑟𝑒 𝑓 𝑢 𝑥 = 𝐴𝑢 𝑤 𝑠 𝑇 𝑥 𝑎𝑛𝑑 𝑤 𝑠 = (𝑤 𝑠 , 1)
problem in the same
Lose function from any linear
complexity as the
Complexity of adaptation
learning-to-rank
algorithm,
e.g.,
original problem
RankNet, LambdaRank, RankSVM
• Instantiation of RankSVM
A fully generative model for exploring users’ search behaviors
1. Draw latent user groups from DP:
2
~𝐺𝑎𝑚𝑚𝑎(𝛼0 , 𝛽0 ) 𝛽𝑘𝑣 ~𝑁(0, 𝑎02 )
𝜇𝑘𝑡 ~𝑁(𝜇0 , 𝜎02 ) 𝜎𝑘𝑡
2. Draw group membership for each user from DP:
𝜋𝑘 ∞
𝑘=1 ~𝐷𝑃(𝛾, 𝜂)
3. To generate a query in user u:
3.1 Draw a latent user group c: 𝑐𝑖 ~𝜋𝑢
2
𝑝
𝑞
~𝑁(𝜇
,
𝜎
3.2 Draw query qi for user u accordingly:
𝑖
𝑘 𝑘 𝐼)
3.3 Draw click preferences for qi accordingly:
Gibbs sampling for
posterior inference
𝑝 𝐷𝑖 𝑞𝑖 =
𝑦𝑖𝑠 >𝑦𝑖𝑡
• Document ranking
1
• 𝑠 𝑑𝑗𝑡 , 𝑞𝑗 =
Experimental Results
|𝑆|
• Yahoo! News search logs
• May to July, 2011
• 65 ranking features for each Query-Document pair
• Query distribution in latent user groups
Group
10
Top Ranked Queries
𝑠∈𝑆
𝑝 𝑐
𝑠
𝑘
P@1
P@3
MRR
0.487
0.616
0.622
0.617
0.298
0.446
0.459
0.449
0.220
0.283
0.283
0.281
0.501
0.632
0.638
0.632
dpRank
0.642
0.485
0.290
site authority
proximity in title• Click preferences in latent user groups
query match in title
0.658
URSVM
GRSVM
TRSVM
IRSVM
today in history, nascar 2011 schedule, today history, this day in history
9
miami heat, los angeles lakers, liverpool football club, arsenal football, nfl lockout
8
los angeles lakers, arsenal football, the dark knight rises, transformers 3,
manchester united
8
the titanic, the bachelorette, cars 2, hangover 2, the voice
6
tree of life, game of thrones, sonic the hedgehog, world of warcraft, mtv awards
2011
casey anthony trial, casey anthony jurors, casey anthony, crude oil prices, air france
flight 447
2
+C
𝜉𝑖𝑗𝑙
𝑞𝑖
fake tupac story, pbs hackers, alaska earthquake, southwest pilot, arizona wildfires
1
2
selena gomez, lady gaga, britney spears, jennifer aniston, taylor swift
0
1
iran, china, libya, vietnam, Syria
Global model
𝑗,𝑙
𝑘
𝐾1 𝑥𝑡 , 𝑥𝑟
1
=
𝜎
User Set
0
•0.2
User Class
# Queries
# Documents
-
49,782
2,320,711
34,827
187,484
1,744,969
% Population
[10, ∞) queries Heavy
6.8
[5, 10) queries Medium
14.9
(0, 5) queries
78.3
•0.4
•0.6
2
4
6
8
Feature ID
10
12
𝑔 𝑣 =𝑘
𝑤𝑣𝑠 𝑥𝑟𝑣
𝑔 𝑣 =𝑘
𝑥𝑡𝑣
𝑘
𝑔 𝑣 =𝑘
𝑥𝑟𝑣
𝑔 𝑣 =𝑘
• Query-level improvement against global model
# Users
Annotation Set
0.2
0
𝑤𝑣𝑠 𝑥𝑡𝑣
𝑤ℎ𝑒𝑟𝑒 𝐾1 𝑥𝑡 , 𝑥𝑟 =
• Adaptation efficiency
per-user basis adaptation baseline
3
3
Non-linear kernels
𝑠. 𝑡. 0 ≤ 𝛼𝑡 ≤ 𝐶, ∀𝑡
0.4
4
2
𝑡
𝜉𝑖𝑗𝑙 ≥ 0
𝑤ℎ𝑒𝑟𝑒 𝑦𝑖𝑗 > 𝑦𝑖𝑙 𝑎𝑛𝑑 Δ𝑥𝑖𝑗𝑙 = 𝑥𝑖𝑗 − 𝑥𝑗𝑙
5
joplin missing, apple icloud, sony hackers, google subpoena, ford transmission
𝛼
1 − 𝑓 𝑥𝑡
• User-level improvement against global model
6
4
max
𝑠. 𝑡. 𝑤 𝑇 Δ𝑥𝑖𝑗𝑙 ≥ 1 − 𝜉𝑖𝑗𝑙 , ∀𝑞𝑖 , 𝑥𝑖𝑗 , 𝑥𝑖𝑙
7
Group ID
7
1
min w
𝑤,𝜉𝑖𝑗𝑙 2
1 𝑇
𝛼𝑡 − 𝛼 𝐾1 𝑥, 𝑥 + 𝐾2 𝑥, 𝑥 𝛼
2
𝑠
• Bing query log: May 27, 2012 – May 31, 2012
• 1830 ranking features
10
document age
Pairwise ranking model
Experimental Results
𝑠
= 𝑘 𝑞𝑗 𝛽𝑘 𝑑𝑗𝑡
MAP
9
5
1
1 + exp(−𝛽𝑘𝑡 (𝑑𝑖𝑠 − 𝑑𝑖𝑡 ))
Margin rescaling
14
•0.8
Light
Method
RA
Cross
RA
Cross
RA
Cross
ΔMAP
ΔP@1
0.1843 0.3309
0.1998 0.3523
0.1102 0.2129
0.1494 0.2561
0.0042 0.0575
0.0403* 0.0894*
ΔP@3
0.0120
0.0182
0.0025
0.0208
-0.0221
-0.0021
ΔMRR
0.1832
0.1994
0.1103
0.1500
0.0041
0.0406*
* Indicates p-value<0.01
Use cross-training to determine feature grouping
Download