KL RLT (u,s) - School of Computer Science

advertisement
Characterizing Web Content, User
Interests, and Search Behavior by
Reading Level and Topic
Jin Young Kim*, Kevyn Collins-Thompson,
Paul Bennett and Susan Dumais
*Work done during internship at Microsoft Research
Search and recommendation
are about the matching.
Queries
Documents
Websites
Users
Term-space matching is not
always a good idea.
Granularity
Sparsity
Efficiency
Can we build representations
beyond the term vectors?
Topic Category
Reading Level
Sentiment
Style
What would be their implications for
search and recommendations?
Queries
Documents
Websites
Users
Topic Category
Reading Level
Sentiment
Style
In a Nutshell,
WHAT WE DID:
WHAT WE FOUND:

Build Profiles of Reading
Level and Topic (RLT)

Profile matching predicts
user’s content preference

For queries, websites,
users and search sessions

Profiles can indicate when
not to personalize

In order to characterize
and compare entities

Profile features can
predict expert content
Building Reading Level and
Topic Profiles
Predicting Reading Level and Topic for URL

Reading Level Classifier


Topic Classifier


Based on language model and other sources
Trained using URLs in each Open Directory Project category
Profile

Distribution over reading level, topic,
or reading level and topic (RLT)
P(R|d1)
P(T|d1)
Entity Profile Built from Related URLs

Entities and Related URLs




Websites : content vs. user-viewed URLs
Users : URLs visited during search sessions
Queries : top-10 retrieved URLs
Example:

Site profile made from URLs visited during search sessions
P(R|d
1)1)
P(R|d
P(R|d
1)
P(T|d
1)1)
P(T|d
P(T|d
1)
P(R,T|s)
Entity Profile Built with Related Entities

Entity and related entities




User – Websites visited
Website – Surfacing queries
Query – Issuing users
Surface
Website
Query
Issue
Visit
Example:

Site profile made from the profiles of its visitors
P(R,T|u)
P(R,T|u)
P(R,T|u)
P(R,T|s)
User
Characterizing and Comparing Profiles

Characterizing an Individual Entity



Characterizing a Group of Entities



Mean : expectation
Variance : entropy
Build a group centroid from its members
Variance : divergence among members
Comparing Entitles and Groups


Difference in mean
Divergence in profile (distribution)
Characterizing Web Content, User
Interests, and Search Behavior
Data Set

Session Log Data



2,281,150 URL visits (1,218,433 SERP clicks)
Collected from 8,841 users
Profiles of Entities



4,715 websites with 25+ clicked URLs
7,613 users with 25+ URL visits
141,325 unique queries
Reading Level Distribution for
Top ODP Categories

Each topic has different reading level distribution
Category
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12
Reference
0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27
Health
0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11
Science
0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17
Computers
0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12
Business
0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12
Society
0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06
Adult
0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06
Kids and Teens
0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08
Games
0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03
Recreation
0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02
Arts
0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02
Home
0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04
News
0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01
Shopping
0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02
Sports
0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02
E[R|T]
8.80
8.53
8.44
8.11
8.08
7.62
6.98
6.60
6.39
6.18
6.18
6.08
5.99
5.98
5.94
Topic and reading level characterize
websites in each category
Profile matching predict user’s preference
over search results

Metric


% of user’s preferences predicted by profile matching,
for each clicked website over the skipped website above
Results


By degree of focus in user profile : H(R,T|u)
By the distance metric between user and website

KLR(u,s) / KLT(u,s) / KLRLT(u,s)
User
Group
↑Focused
↓Diverse
#Clicks
KLR(u,s) KLT(u,s) KLRLT(u,s)
5,960
59.23%
60.79%
65.27%
147,195
52.25%
54.20%
54.41%
197,733
52.75%
53.36%
53.63%
Users’ Deviation from Their Own Profiles

Stretch reading


Session-level reading level >> Long-term reading level
Casual reading

Session-level reading level << Long-term reading level
URL Title Words for
Stretch Reading
Log
Title word
ratio
tests
2.22
test
1.99
sample
1.94
digital
1.88
(tuition) options
1.87
(financial) aid
1.87
(medication) effects
1.84
education
1.77
URL Title Words for
Casual Reading
Log
Title word
ratio
best
-0.42
football
-0.45
store
-0.46
great (deals)
-0.47
items
-0.52
new
-0.53
sale
-0.61
games
-0.65
Comparing Expert vs. Non-expert URLs

Expert vs. Non-expert URLs taken from [White’09]
Predicting Expert vs. Novice Websites


Results
Baseline
(predict most likely class)
65.8%
Classifier accuracy
82.2%
Features
Feature
Correl. with
Expertness
Description
E[R|Qs]
+0.34
Expectation of Surfacing Query's RL
E[R|Us]
+0.44
Expectation of Visitor's RL
DivRLT(U,s)
-0.56
Distance of visitors’ RLT profile from site's
DivT(U,s)
-0.55
Distance of visitors’ Topic profile from site's
Thank you for your attention!
WHAT WE DID:
WHAT WE FOUND:

Build Profiles of Reading
Level and Topic (RLT)

Profile matching predict
user’s content preference

For Queries, Websites,
Users and Search Sessions

Profiles can indicate when
not to personalize

To characterize and
compare entities

Profile features can
predict expert content
More at : @jin4ir / cs.umass.edu/~jykim
Optional Slides
Correlation between Site vs. Visitor Profiles

Website reading level vs. visitor diversity
Website
Reading Level
E[R|s]

Visitor Profile Diversity
DivR(U|s)
DivT(U|s)
DivRT(U|s)
0.052
0.081
0.095
Breakdown per topic reveals
stronger relationship
Kids_and_Teens
Shopping
Home
Games
Adult
Business
Society
Sports
Health
Science
Recreation
Arts
News
Reference
Computers
-0.4
-0.2
0
0.2
0.4
Query / User Reading Level against P(Topic)

User profile shows different trends in Computers
Download