WWW Conference Paper Review Jonathan Artificial Intelligence Lab University of Arizona 2015/4/8 1 Outline • Overview of WWW • Paper Review • Summary 2015/4/8 2 Overview • Annual Conference – 2008 Beijing, China – 2009 Madrid, Spain – 2010 Raleigh, US – 2011 Hyderabad, India • Submission Track – Research Papers – Poster – Other • Demo Proposal, Workshop Proposal etc. 2015/4/8 3 Areas and Topics • Data Mining and Machine Learning – Deriving actionable insight from Web information sources: query logs, Web graph, click trails, text documents, etc. • Social Networks – Models, algorithms, systems and issues around social networks and collaborative environments. • Internet Monetization – Markets, auctions, games, pricing, advertising, and other Web-specific economic activities. 2015/4/8 4 Areas and Topics • • • • • • • • • • Security and Privacy Semantic Web Search Bridging Structured and Unstructured Data Software Architecture and Infrastructure Performance, Scalability and Availability Networking and Mobility Users Interfaces and Rich Interaction Rich Media Web Services and Service-Oriented Computing 2015/4/8 5 Major Groups in Data Mining • Academy – – – – University of Illinois Urbana-Champaign(11) Stanford University(4) Cornell University(3) Arizona State University(3) • Industry – – – – 2015/4/8 Google(8) Microsoft(7) Yahoo!(5) IBM(2) 6 Best Papers • 2008 – IRLbot: Scaling to 6 Billion Pages and Beyond, Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov, (Texas A&M University) • 2009 – Hybrid Keyword Search Auctions. Ashish Goel (Stanford University), Kamesh Munagala (Duke University) • 2010 – Factorizing Personalized Markov Chains for Next-Basket Recommendation. Steffen Rendle(Osaka University), Christoph Freudenthaler and Lars SchmidtThieme(University of Hildersheim). 2015/4/8 7 Trends • More and more research is combining and integrating different approaches to the same problem as an innovation. • In data mining field, increasing number of studies are combining both text mining and social network analysis. 2015/4/8 8 Factorizing Personalized Markov Chains for NextBasket Recommendation S. Rendle (Osaka University, Japan) C. Fraudenthaler (University of Hildersheim, Germany) L.S. Thieme (University of Hildersheim, Germany) 2015/4/8 9 Overview • 2010 Best Paper • Research Question: Can we provide better product recommendation for different users? Non personalized MC – Matrix Factorization(MF) – Markov Chain (MC) • Methodology:Factorized Personalized Markov Chain Model(FPMC) • Cinclusion: Proposed method outperforms MF model and non personalized MC model. 2015/4/8 FPMC 10 FPMC • Task:estimate each ? in the cube. – But too many ?s, too little data for each user. • Solution: FPMC – Factorize the cube, so each user u's transition probability from item i to j is influenced by the transitions by the same user, from the same item i and to the same item j. FPMC 2015/4/8 11 Take-away • May be helpful when modeling Sequential data which can be grouped(personalized). • Potential application in AI Lab – Combine authorship analysis and sequential text mining. – Predict the next word/sentence/paragraph of a particular author. 2015/4/8 12 Topic Modeling with Network Regularization Q. Mei (University of Illinois at Urbana-Champaign) D. Cai (University of Illinois at Urbana-Champaign) D. Zhang (University of Illinois at Urbana-Champaign) C. Zhai (University of Illinois at Urbana-Champaign) 2015/4/8 13 Overview • Research Question: Can we improve topic modeling by incorporating knowledge on network structure? • Methodology: Topic Modeling with Network Structure(TMN). Network Probabilistic Latent Semancitc Analysis(NetPLSA) was used for example. • Conclusion: Proposed approach outperforms both pure textoriented method and networkoriented methods. 2015/4/8 14 Topic Modeling with Network Structure • In general, TMN is a framework for combining arbitrary topic model and network constraints. – It builds an objective function to balance between maximizing the likelihood of the generated topic model and minimizing the topic distribution differences of adjacent nodes on the network graph. Geographic topic distribution for Hurricane Katrina 2015/4/8 15 Take-away • When we want to deal with text data to which a network structure is attached, we may find TMN framework helpful. • Potential application in AI Lab – Geopolitical topic modeling. – Incorporate reply network into forum topic modeling. 2015/4/8 16 Exploiting Social Context for Review Quality Prediction Y. Lu (University of Illinois at Urbana-Champaign) P. Tsaparas (Microsoft) A. Ntoulas (Microsoft) L. Polanyi (Microsoft) 2015/4/8 17 Overview • Research Question: Can we improve review quality prediction by incorporating social context into text features? • Methodology: Linear Regresion with Regulariziation constraints. • Conclusion: Prediction accuracy is greatly increased. 2015/4/8 18 Regression with social context constriants • In the regression model, besides textual features, social context features are used as regularization constraints in the regression model. – Author consistency – Trust consistency – Co-citation consistency • Minimize both mean square error and the conflicts to the above three consistency conditions. 2015/4/8 19 Take-away • A good example of utilizing text data with a network structure attached. • When we want to give numerical scores for textual data, we can use relationship to adjust these scores. • Potential application in AI Lab – sentiment analysis. 2015/4/8 20 Topic Initiator Detection on the World Wide Web X. Jin (University of Illinois at Urbana-Champaign) S. Spangler (IBM) R. Ma (IBM) J. Han (University of Illinois at Urbana-Champaign) 2015/4/8 21 Overview • Research Question: How can we find the initiator on some topic in online media? • Methodology: InitRank • Conclusion: Proposed method outperforms baseline models such as sorting the documents by time. 2015/4/8 22 InitRank:TCL Graph • After extracting initiator indicator attributes for all documents on a topic, TCL graph is constructed – TCL=Time+Content+Link • Two kinds of relationship exist between document nodes in the graph. – Link(Solid) • Point to referenced document. – Document similarity(dashed) • Point to earlier document. • Initiator values for nodes are initialized by other attributes such as centrality, novelty, originality and document length. • Then, these values are optimized on the graph. 2015/4/8 23 Take-away • Again, a good example of combining text mining and social network analysis. • Potential application in AI Lab – This framework may be useful in modeling the "paths" in information diffusion. 2015/4/8 24 AdHeat: An Influence-based Diffusioin Model for Propagating Hints to Match Ads H. Bao (Google) E. Chang (Google) 2015/4/8 25 Overview • Research Question: In social network, is targeting ads to a user based upon other users' influences better than targeting based on this user's features? – Empirically, a user expertised in one area shows no interest in ads in this area. – In this regard, this research attempts to target ads based upon other user's information that influence the target user best. • Methodology: Heat diffusion model • Conclusion: Influence based model outperforms traditional model in terms of click-through-rate(CTR). 2015/4/8 26 AdHeat Model • 1)Social Network Constructing – a. Edge weights are calculated based on relationship attributes. – b. Influence score for each user is calculated based on HITS(Hypertext Induced Topic Selection). • 2)Hint-word Generation--LDA(Latent Dirichlet Allocation) • 3)Influence Propagation--Heat Diffusion Equation 0.6 0.8 u1 0.8 0.6 0.4 0.4 u2 u4 u1 music,0.4; guitar,0.6 u2 movie,0.14;art,0.46;music,0.2; guitar,0.2 u3 basketball,0.6;movie,0.15;art,0.25 u4 concert,0.1;cooking0.12;music,0.21;guitar,0.25; 27 movie,0.14;art,0.02;basketball,0.16; 0.6 0.5 u3 2015/4/8 0.2 Take-away • Attributes for an instance(user) can also be modeled indirectly from other nodes by looking at their relationships. • Potential application in AI Lab – When clustering stakeholder groups, besides writing style, we can also pay attention to what topics are read most by an author to help identifying his group. 2015/4/8 28 Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums J Yang (Microsoft) R. Cai (Microsoft) Y. Wang (Chinese Academy of Science) J. Zhu (Tsinghua University) L. Zhang (Microsoft) W. Ma (Microsoft) 2015/4/8 29 Overview • Research Question: Can we create a general forum crawler to extract structured data from any forums? • Methodology: Markov Logic Networks(MLN) • Conclusion: Proposed mechanism is shown to be quite promising. 2015/4/8 30 Markov Logic Networks • A probabilistic extension of first-order logics. – A Markov logic contains multiple assertions called formulas, each of which is assigned a weight. – An instance does not have to meet all the formulas to confirm the final assertion. – This "fuzziness" handles the differences in various forum designs, and contributes to the generalizability of the forum crawler. *Example of detecting thread title: h: an HTML element 2015/4/8 31 Take-away • A promising framework to facilitate spidering and parsing in future. • MLN may be useful when you need to enhance compatibility of a system. • Potential application in AI Lab – Employ MLN to process textual information intelligently. 2015/4/8 32 Summary Situation Potentially useful model Sequential data that can be grouped FPMC Text data with a clear network structure attached TMN or RegularizedRegression Modeling information diffusion among dataset containing noises TCL Features can be extracted from a class of entities are too limited AdHeat need flexibility in decision making or compatibility for system MLN 2015/4/8 33 References • B. Hongji, E.Y. Chang. 2010. AdHeat: An Influence-based Diffusioin Model for Propagating Hints to Match Ads. In Proceedings of the 19th international conference on World wide web. • S. Goel, R. Muhamad, D. Watts. 2009. Social Search in "Small-World" Experiments. In Proceedings of the 18th international conference on World wide web. • X. Jin, S. Spangle, R. Ma, J, Han. 2010. Topic Initiator Detection on the World Wode Web. In Proceedings of the 19th international conference on World wide web. • Y. Lu, P. Tsaparas, A. Ntoulas, L. Polyani. 2010. Exploiting Social Context for Review Quality Prediction. In Proceedings of the 19th international conference on World wide web. • S. Rendle, C. Freudenthaler, L.S. Thieme. 2010. Factorizing Personalized Markov Chains for Next-Basket Recommendation. In Proceedings of the 19th international conference on World wide web. 2015/4/8 34 References • H. Lee, D. Leonard, X. Wang, D. Loguinov. 2008. IRLbot: Scaling to 6 Billion Pages and Beyond. In Proceedings of the 17th international conference on World wide web. • A. Goel, K. Munagala. 2009. Hybrid Keyword Search Auctions. In Proceedings of the 18th international conference on World wide web. • J Yang, R. Cai, Y. Wang,J. Zhu,L. Zhang, W. Ma. 2008. Incorporating SiteLevel Knowledge to Extract Structured Data from Web Forums. In Proceedings of the 17th international conference on World wide web. • J. Y, R. Cai, Y. Wang, J. Zhu, L. Zhang, W. Ma. 2009. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums. In Proceedings of the 18th international conference on World wide web. 2015/4/8 35 Social Search in "Small-World" Experiments S. Goel (Yahoo!) R. Muhamad (Columbia University) D. Watts (Yahoo!) 2015/4/8 36 Overview • 2009 Best Paper Nominee • Research Question: Are individuals able to find theoretically shortest path connecting to anyone in the social network? – Every pair of individuals are connected by about 6 intermediaries. • Topological distance • Search distance • Methodology: Message-forwarding experiment;Logistic Multilevel Regression • Conclusion: The mean chain length in algorithmic sense is much larger than 6. 2015/4/8 37 Attrition in Connectivity • Attrition Rate – The probability of message forwarding to stop at some node. • Motivation:estimate the real algorithmic chain length – Chain length cannot be directly obtained since in experiment, more than 99% messages fail to reach final recipients because of "attrition". • Attrition rate can be affected by network topology and individual difference. – People with high social status(educated, wealthy etc.) tend to have lower attrition rate. 2015/4/8 38 Take-away • When modeling directed social relationship, we may take individual differences into account. • Potential application in AI Lab – Consider the attrition in opinion diffusion model. 2015/4/8 39