Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research Search and recommendation are about the matching. Queries Documents Websites Users Term-space matching is not always a good idea. Granularity Sparsity Efficiency Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style What would be their implications for search and recommendations? Queries Documents Websites Users Topic Category Reading Level Sentiment Style In a Nutshell, WHAT WE DID: WHAT WE FOUND: Build Profiles of Reading Level and Topic (RLT) Profile matching predicts user’s content preference For queries, websites, users and search sessions Profiles can indicate when not to personalize In order to characterize and compare entities Profile features can predict expert content Building Reading Level and Topic Profiles Predicting Reading Level and Topic for URL Reading Level Classifier Topic Classifier Based on language model and other sources Trained using URLs in each Open Directory Project category Profile Distribution over reading level, topic, or reading level and topic (RLT) P(R|d1) P(T|d1) Entity Profile Built from Related URLs Entities and Related URLs Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs Example: Site profile made from URLs visited during search sessions P(R|d 1)1) P(R|d P(R|d 1) P(T|d 1)1) P(T|d P(T|d 1) P(R,T|s) Entity Profile Built with Related Entities Entity and related entities User – Websites visited Website – Surfacing queries Query – Issuing users Surface Website Query Issue Visit Example: Site profile made from the profiles of its visitors P(R,T|u) P(R,T|u) P(R,T|u) P(R,T|s) User Characterizing and Comparing Profiles Characterizing an Individual Entity Characterizing a Group of Entities Mean : expectation Variance : entropy Build a group centroid from its members Variance : divergence among members Comparing Entitles and Groups Difference in mean Divergence in profile (distribution) Characterizing Web Content, User Interests, and Search Behavior Data Set Session Log Data 2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users Profiles of Entities 4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries Reading Level Distribution for Top ODP Categories Each topic has different reading level distribution Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02 E[R|T] 8.80 8.53 8.44 8.11 8.08 7.62 6.98 6.60 6.39 6.18 6.18 6.08 5.99 5.98 5.94 Topic and reading level characterize websites in each category Profile matching predict user’s preference over search results Metric % of user’s preferences predicted by profile matching, for each clicked website over the skipped website above Results By degree of focus in user profile : H(R,T|u) By the distance metric between user and website KLR(u,s) / KLT(u,s) / KLRLT(u,s) User Group ↑Focused ↓Diverse #Clicks KLR(u,s) KLT(u,s) KLRLT(u,s) 5,960 59.23% 60.79% 65.27% 147,195 52.25% 54.20% 54.41% 197,733 52.75% 53.36% 53.63% Users’ Deviation from Their Own Profiles Stretch reading Session-level reading level >> Long-term reading level Casual reading Session-level reading level << Long-term reading level URL Title Words for Stretch Reading Log Title word ratio tests 2.22 test 1.99 sample 1.94 digital 1.88 (tuition) options 1.87 (financial) aid 1.87 (medication) effects 1.84 education 1.77 URL Title Words for Casual Reading Log Title word ratio best -0.42 football -0.45 store -0.46 great (deals) -0.47 items -0.52 new -0.53 sale -0.61 games -0.65 Comparing Expert vs. Non-expert URLs Expert vs. Non-expert URLs taken from [White’09] Predicting Expert vs. Novice Websites Results Baseline (predict most likely class) 65.8% Classifier accuracy 82.2% Features Feature Correl. with Expertness Description E[R|Qs] +0.34 Expectation of Surfacing Query's RL E[R|Us] +0.44 Expectation of Visitor's RL DivRLT(U,s) -0.56 Distance of visitors’ RLT profile from site's DivT(U,s) -0.55 Distance of visitors’ Topic profile from site's Thank you for your attention! WHAT WE DID: WHAT WE FOUND: Build Profiles of Reading Level and Topic (RLT) Profile matching predict user’s content preference For Queries, Websites, Users and Search Sessions Profiles can indicate when not to personalize To characterize and compare entities Profile features can predict expert content More at : @jin4ir / cs.umass.edu/~jykim Optional Slides Correlation between Site vs. Visitor Profiles Website reading level vs. visitor diversity Website Reading Level E[R|s] Visitor Profile Diversity DivR(U|s) DivT(U|s) DivRT(U|s) 0.052 0.081 0.095 Breakdown per topic reveals stronger relationship Kids_and_Teens Shopping Home Games Adult Business Society Sports Health Science Recreation Arts News Reference Computers -0.4 -0.2 0 0.2 0.4 Query / User Reading Level against P(Topic) User profile shows different trends in Computers