Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA) kcsia@cs.ucla.edu Sept 8 2008 1 Explosion of user-generated content Doubling every 5 months – by Technorati 2 Characteristics of content About 97%-98% daily content are new 50 words shingles 62% weekly content are new on the web (“Whats new on the Web on the Web? The Evolution of the Web from a Search Engine Perspective”, by Ntoulas et.al., WWW 20004) 3 Characteristics of content Mostly consist of current event chatter Politics Technology Entertainment Sports 4 The Yahoo! Buzz service 5 Agenda Introduction: growth and characteristics of usergenerated content Three aspects Monitoring: How to deliver fresh content to users Aggregation: How to efficiently deliver personalized results to users Analysis of tagging data: Making tagging data useful for advertisers 6 Framework Pull model: A central server monitors data source changes and provides digested content to users Push model: data sources notify server for updates 7 Overview New challenges Content update more frequently with recurring pattern More time-sensitive requirements Modeling of post update Definition of delay Strategies for allocation and scheduling 8 How updates are changed? Homogeneous Poisson model λ(t) = λ at any t Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,… 9 Definition of metrics Delay of a data source sum of elapsed time for every post D (ti)j ti k D (O ) D (ti) i 1 Delay experienced by the aggregator n D (A ) w (O iD i) i 1 10 Approach Resource allocation How often to contact data sources? O1 is more active than O2, how much more often should we contact O1 than O2? mi wii Retrieval scheduling When to contact a data source? 2 retrievals are allocated for O1, when should these 2 retrievals be located? 11 Single retrieval per period example λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2 τ = 0.5, expected delay = 0.75 τ = 1, expected delay = 0.5 τ = 2, expected delay = 1.5 12 Multiple retrievals per period m retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by: m D ( O ) ( t)( i1 t) dt i 1 i 1 i m T 1 1 Criteria for optimalit j (j)( j) ( t ) dt j 1 j 1 13 Example Criteria for optim 6 retrievals for λ(t)=2+2sin(2πt) j ( j)( j1 j) ( t) dt j 1 14 Experiment Data – 10k RSS feeds from syndic8.com collected during Oct – Dec 2004 Typical power law distribution – good for resource allocation 15 Performance CGM03 (“Effective page refresh policy for Web crawlers”, by Cho and Garcia-Molina in ACM TODS 2003) Homogenous Poisson model Optimize for “age” metrics Ours – both resource allocation and retrieval scheduling 16 Size of estimation window Resource constraint: 4 retrievals per day per feeds on average 2 weeks seems an appropriate choice 17 Consistency of posting rate 90% of the RSS feeds post consistently 18 Summary Resource allocation is aggressive Retrieval scheduling optimizes within individual data source Significantly improved freshness of content Also considered user browsing pattern “Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho, HyunKyu Cho, in IEEE TKDE 2007 “Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007 19 Agenda Introduction: growth and characteristics of usergenerated content Three aspects Monitoring: How to deliver fresh content to users Aggregation: How to efficiently deliver personalized results to users Analysis of tagging data: Making tagging data useful for advertisers 20 Aggregate query over blogs User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events Aggregation of individual user opition to show current popular trends 21 Motivation Global aggregation (examples from blogpulse.com) Recent news got picked up quickly “Dark Knight” in the week of July 18 “Olympics” related phrases in the week of August 8 Potential drawbacks What if a user not interested in entertainment at all? Groups of bloggers collaborated to promote advertisement videos Personal aggregation Users selectively aggregate from different sources Efficient strategy to handle large number of users and sources 22 From global to personal aggregation Michael Phelps performance in Olympics is awesome... Finished watching Michael Phelps in Olympics, let me try the WALL-E DVD... Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight and WALL-E bloggers items (phrases) Olympics Dark Knight Michael Phelps Las Vegas WALL-E 23 Matrix forumulation Endorsement matrix (E) - e.g. the number of times a blogger mentions an object (keywords / links) in his posts. Trust matrix (T) - e.g. how often a user reads from a blog Personalized score (TE) – weighted endorsement score by a user’s trust vector E o1 o2 O3 T b1 b2 b3 b4 TE o1 o2 o3 b1 3 2 0 u1 0.8 0.8 0 0 u1 2.4 4.0 0.0 b2 0 3 0 u2 0.2 0.2 0.6 0.6 u2 1.8 2.2 2.4 b3 b4 1 1 0 2 1 3 u3 0 0 0.5 0.5 u3 1.0 1.0 2 Total 5 7 4 24 Baseline implementations Endorsement (blog_id, iterm, score), Trust (user_id, blog_id, score) Personal Aggregate Query SELECT t.item, sum(t.score*e.score) As p_score FROM Endorsement e, Trust t WHERE e.blog_id = t.blog_id AND t.user_id = <user id> GROUP BY t.items ORDER BY p_score DESC LIMIT 20 On-the-fly (OTF) View 25 Optimizing the query Identify “template” users Typical users interested in sports / politics / technology / ... Results of template users are pre-computed Results of individual users are combined from partially computed results 26 Using NMF to discover user groups Factorize trust matrix Decompose T into two sub-matrices W and H Non-negative matrix factorization W: <individual users : template users> relationship H: <template users : blogs> relationship User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2 NMF as an approximation of original trust matrix 27 Reconstruction of results PersonalizedEndorsement score of template users are pre-computed, results of individual users are computed on request (HE) is maintained as sorted-lists for all template users W * (HE) is the personal aggregation result Computed using Threshold Algorithm (by Fagin et.al. PODS 2001) Top-K list (HE) are sorted lists W * (HE) is weighted linear combination 28 Partition of trust matrix Decomposition is useful when matrix is dense Real life data is often skewed (by Akshay et.al. ICWSM 2007) Hybrid method: uses decomposition only when it is effective 2.7M subscription pairs 2. VIEW Blogs with more subscribers 1. OTF Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users ~1M subscription pairs 3. NMF Users with more subscription 29 Experiments Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile 91K users 487K RSS feeds Endorsement matrix E: blog – keywords occurrence Feed content collected between Nov 2006-Jul 2007 Keywords filtered by nouns with high tf-idf values Platform Python implementation of proposed scheme MySQL server on linux with data stored on RAID 30 How different is personalization? Week 2007 Jan 7 – 2007 Jan 13 major event: iphone released Global sales iphone apple manager iraq management development software business phone 2007-01-07 to 2007-01-13 User 90439 User 90550 cattle brazil beef iguazu iphone reuters chicago search iraq vegas bush argentina apple kibbutz companies video prices cathartik quarter google User 91017 yorker iraq bush president views avenue dept troops saddam iran Personal aggregation results differ from global aggregation 31 How different is personalization? Overlap comparison of global aggregation and personal aggregation LG – global top 20 items Li – individual top 20 items of user i Personal aggregation results also differ among users Overlap degree with global aggregation result L G∩ Li Pair-wise among users Li∩ L j 32 Approximation accuracy Dense region of subscription matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users L2 norm comparison Rank SVD NMF 80 848.5 856.9 90 841.6 850.1 100 835.1 844.6 110 829.0 837.9 120 823.2 833.0 Sparsity of W (23%), H (13%) NMF approximation is close to SVD with sparseness adv. 33 Approximation accuracy How many items are approximated by NMF in top 20 list? Ti – top 20 items of user i computed by OTF Ai – top 20 items of user i computed by NMF 70% approximation and more accurate for higher rank items | Ai Ti | / | Ti | Correlation with rank 34 Efficiency of proposed method Update cost (for 1 week data) OTF (222K) < NMF (3.2M) < VIEW (23.6M) Query response time Average over 1000 users with highest number of subscription OTF: execute SQL query on MySQL server NMF: phython implementation of Threshold Alogrithm that interface MySQL server Method avg std max min OTF 2.05s 3.60s 84.42s 0.037s NMF 0.46s 0.53s 2.84s 0.007s Average query response time reduced by 75%, eliminated outliers of significant delay 35 Summary Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy “Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008 “Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007 36 Agenda Introduction: growth and characteristics of usergenerated content Three aspects Monitoring: How to deliver fresh content to users Aggregation: How to efficiently deliver personalized results to users Analysis of tagging data: Making tagging data useful for advertisers 37 More than just tag-cloud 38 LDA of tagging data d – bookmark, w – tag (merged all users), z – topics Sample of topics ID Topics Tags (top 10 p(w|z)) 6 Environmental Protection environment energy green ibm item sustainability oil power solar alternative 27 Politics politics government war iraq usa activism bush military political terrorism 38 Literature books book literature ebooks free reading poetry ebook publishing writing 49 Copyrights law copyright legal drm sony rights remote vnc ethics creativcommons 55 Health health fitness medicine regex medical drugs exercise running diet training 63 Linux linux unix ubuntu debian os sysadmin shell kernel bash livecd 69 Photography photography photo photos flickr camera gallery images pictures digital photoblog 39 Change of entropy Tags with increasing popularity in a period Correspond to well-established topics where users have common consensus Correspond to developing topics where users are willing to explore new pages 40 Change of topic association • • “Programmers” at Oct 2005 – programming, development, code, patterns, dev, coding, algorithms, scheme, software, ... “Programmers” at Jan 2006 – programming, development, code, lisp, dev, coding, algorithms, scheme, software, cs, ... – work, jobs, career, job, shell, sleep, uml, regex, scripting, bash, ... 41 Specificity of word semantics Entropy vs idf metrics 42 Summary Features can be combined to build a classifier for words Tag entropy change rate KL-divergence of topic distribution Entropy of semantic Assist advertisers to select better keywords for advertisement “Exploring Social Annotations for Word Usage Evolution” – work in progress 43 Thank you! 44 Definition of metrics τj – retrieval time λ(t) – posting rate Expected delay Homogeneous Poisson model ( j j1)2 D ( O ) 2 Inhomogeneous Poisson model j D ( O ) ( t )( j t ) dt j 1 45 Resource allocation Consider n data source O1, …, On λi – posting rate of Oi wi – weight of Oi N – total number of retrievals per day mi – number of retrievals per day allocated to Oi Optimal allocation mi wii 46 Single retrieval per period For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by: T D ( ) t )( t ) dt t )( T t ) dt ( ( 0 Criteria for optimalit T 1 d () () ( t ) dt and 0 0 T dt 47 Blogs becoming inactive Detection of abandoned blog to save resource [2] D.R. Cox “Regression models and life-tables (with discussion)” Journal of the Royal Statistical Society, B(34), 1972 [3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality” Technical report, Microsoft Research 48 More examples 49 Major posting patterns K – means clustering 50 Threshold algorithm Proposed by Fagin et.al. [2001] Efficient computation of top-K items from multiple lists with a monotone aggregate function blogs user groups users 51 Illustration of matrix partition User with more subscriptions 2 subscribers Feeds with More subscribers 9 subscribers 2 subscriptions 8 subscriptions 52