Modeling Heterogeneous Networks for Information Ranking, Enrichment and Resolution on Microblogs Hongzhao Huang huangh9@rpi.edu Advisor: Dr. Heng Ji Computer Science Department Rensselaer Polytechnic Institute April 9, 2015 Doctoral Committee: Dr. Heng Ji (Chair, RPI) Dr. Peter Fox (RPI) Dr. James Hendler (RPI) Dr. Chin-Yew Lin (MSR) Dr. Yizhou Sun (NEU) Outline Introduction o o o o Contribution I: A HIN-based Ranking Model o The Models + Evaluation Contributions IV: A HIN-based Resolution Model o The Model + Evaluation Contributions II & III: HIN-based Linking and Semantic Relatedness Models o Background Overall Problem Overview of State-of-the-art Approaches Contributions The Model + Evaluation Conclusions and Future Directions Related Publications 2 Background Heterogeneous Information Network (HIN) o Contain multiple types of objects/relations Homogeneous Information Network o Contain single type of objects and relations DBLP Bibliographic HIN Co-author homogeneous network 3 Modeling HINs is Powerful in Data Mining Advantages o o Incorporate richer information Differentiate multi-typed objects and linked relations DBLP Bibliographic HIN Co-author homogeneous network 4 Modeling HINs is Powerful in Data Mining In Data Mining o o o o o Ranking: (Deng2009; Sun2009a) Clustering: (Sun2009b; Sun2012; Deng2011) Similarity and Link Analysis: (Sun2011a; Sun2011b) Classification: (Ji2010; Kong2012) Most based on existing clean and rich HINs (e.g., DBLP Bibliographic Network) Leveraging HINs is challenging in NLP o o NLP mainly focuses on more fine-grained units such as words and phrases in unstructured texts In most cases, clean and rich HINs do not exist In this thesis, we aim to explore whether modeling HINs is also powerful in NLP 5 Microblogging Some facts about Twitter: o o o o An example microblog on Hurricane Irene 2011 o 288 million monthly active users 500 million tweets per day 1000 users on average to a retweeted message (Kwak et al., 2010) A retweeted message will be disseminated instantly on next hops (Kwak et al., 2010) across the street is an evacuation zone, but my side of the street isn't. here's to the hurricane coloring in the lines: http://t.co/uiavHQh #irene An unique information resource o o o Real-time, diverse, detailed… Contain super-fresh information A fast information diffusion platform 6 Characteristic One: Noisy Pear Analytics 2009 reported on 2000 Tweets 7 Characteristic Two: Short Maximum 140 characters in each single message o Information brevity is pervasive We must get out of this Slump. We have to stay together. Go Hawks! 8 Characteristic Three: Informal and More Implicit Information Microblog posts are informal and tend to contain more implicit information o Free usage of languages Miss-spellings Informal/implicit terms = “Conquer West King” (平西王) = “Bo Xilai” (薄熙来) “Baby” (宝宝) “Wen Jiabao” (温家宝) We call this phenomenon ``Information Morphing” 9 Characteristic Three: Informal and More Implicit Information Will King and KD burn out? takes a look at the fatigue factor entering the playoffs. I think the Good Doctor is too crazy to hang it up Lebron James Ron Paul 10 Overall Problem Goal: design effective approaches to enhance natural language understanding in microblogging Three important sub-problems o o o Identify informative information To solve the information noiseness problem Enrich information from a knowledge base (e.g., Wikipedia and Freebase) with rich and clean background knowledge To solve the information brevity problem Resolve informal and more implicit information To solve the information informality and implicitness problem 11 Sub-problem 1: Identify Informative Information Ranking Microblogs based on Informativeness After temporal and spatial constraints, informative to a general audience or helpful for event tracking o o o Informative Microblog Examples o o Breaking news Real-time coverage of ongoing events … New Yorkers, find your exact evacuation zone by your address here: http://t.co/9NhiGKG /via @user #Irene #hurricane #NY Details of Aer Lingus flights affected by Hurricane Irene can be found at http://t.co/PCqE74\u201d Uninformative Microblog Examples o o Me, Myself, and Hurricane Irene. I'm ready For hurricane Irene. 12 Sub-problem 2: Information Enrichment from a KB Wikification for Microblogs Identify linkable mentions from a microblog and disambiguate them to their referent concepts in a Knowledge Base A mention: a phrase referring to a concept in the world A concept: a page in a Knowledge Base We must get out of this Slump. We have to stay together. Go Hawks! 13 Sub-problem 3: Resolve Informal and More Implicit Information Morph Resolution Goal: automatically determine which term is used as a morph, and resolve it to its regular referent Conquer West King from Chongqing fell from power, do we still need to sing red songs? 14 Heterogeneous Networks in Microblogging Web Documents Microblogs Social User Community Semantic Relationship Semantic Relationship Semantic Relationship Social User Community Knowledge Base Concept Mentions Leveraging and modeling HINs to enhance natural language understanding in microblogging 15 State-of-the-art Approaches and Limitations Link-based ranking or similarity methods based on homogeneous networks (Hisiung et al., 2005; Milne and Witten, 2008; Huang et al., 2011, Mihalcea and Tarau, 2004) o o Ignore discrepancies between multi-typed objects and linked relations Ignore cross-genre and cross-type information Modeling HINs to incorporate richer information and capture their discrepancies 16 State-of-the-art Approaches and Limitations Supervised ranking or linking models with multiple levels of features (e.g., content and social features) (Duan et al., 2010; Meij et al., 2012; Guo et al., 2013) o o Require a large amount of training data Ingore global evidence from multiple posts Modeling HINs to incorporate global evidence and perform collective inference over both labeled and unlabeled data to save annotation cost 17 Contributions A HIN-based ranking model that significantly improves microblog ranking quality o A HIN-based linking model that dramatically saves annotation cost and achieves better performance for wikification o o A new deep semantic relatedness model that captures latent semantics of concepts is developed A HIN-based resolution model that substantially outperforms existing alias detection method o A new collective inference model is designed that incorporates global evidence and leverage a large amount of unlabeled data A HIN-based semantic relatedness model that significantly enhances both relatedness and disambiguation quality o A new unsupervised propagation model is developed to rank microblogs, web documents, and users simultaneously Directly model unstructured texts with HINs An uncommon effort to explore heterogeneous networks to improve NLP approaches 18 Contribution I: A HIN-based Ranking Model 19 Hypotheses Interdependencies exist between multi-typed objects Hypothesis 1: Informative microblogs are more likely to be posted by authoritative users; and vice versa (authoritative users are more likely to post informative microblogs) Hypothesis 2: Microblogs involving many users are more likely to be informative o Similar microblogs appear with high frequency o Synchronous behavior of users indicates informative information 20 Hypotheses Hypothesis 3: Microblogs aligned with contents of web documents are more likely to be informative o o o New Yorkers, find your exact evacuation zone by your address here: http://t.co/9NhiGKG /via @user #Irene \#hurricane \#NY Details of Aer Lingus flights affected by Hurricane Irene can be found at http://t.co/PCqE74V\u201d Hurricane Irene: City by City Forecasts http://t.co/x1t122A 21 Tri-HITS: Ranking Microblogs based on HINs Context Similarity (cosine similarity and tf-idf) Explicit Links are sparse Web Documents Microblogs Users Infer implicit microblog-user relations o U1 posts M1, if sim(M1,M2) exceeds an threshold, an edge is created for U1 and M2 22 Tri-HITS: preliminaries Similarity matrix Wdt Transition matrix Pdt Heterogeneous Networks Initial ranking scores S0(d) S0(m) S0(u) 0.45 M1 D1 0.5 0.8 D2 1.0 U1 M2 1.0 0.1 M3 U2 Implicit links between microblogs and web documents: Wmd Wdm Explicit and implicit links between microblogs and users: Wmu Wum 1.0 23 Propagation from Microblogs to Web Documents Tri-HITS: based on the similarity matrix Co-HITS: based on transition matrix (Deng et al., 2009) Differences between Tri-HITS and Co-HITS: o o o Tri-HITS normalize the propagated ranking scores based on original similarity matrix Co-HITS propagates normalized ranking scores using the transition matrix Co-HITs weakens or damages the semantic meaning of implicit links in our experimental setting 24 Tri-HITS (con’t) Propagation from microblogs to users Propagation from web documents and users to microblogs Set Set to 0 will only consider microblog-user networks to 0 will only consider web-microblog networks 25 Data and Scoring Metric Data o o o Monitored 3,460 microblogs posted on different days Two annotators assigned each microblog a score of 1-5 in parallel, initial agreement is 66%; adjudicated until difference <=1, take lower grade Criteria Whether the microblog is likely to be news? Does the microblog include information that a general audience will be concerned about during an event? The relative informativeness in the data pool Label 5 4 3 2 1 Hour 1 65 48 93 119 847 Hour 2 135 159 255 164 458 Hour 3 129 102 162 123 602 Distribution of Grades Evaluation Metric: nDCG o Combine informativeness and ranking position 26 Overall Performance (COLING’12) Evidence from multigenre networks improves TextRank significantly Knowledge transferred from the Web and Social Networks dramatically boosted quality Modeling Heterogeneous Networks is effective 27 Contribution II & III: HIN-based Linking and Semantic Relatedness Models 28 Collective Wikificaion based on Semi-supervised Graph Regularization Relational graph o Each pair of mention m and concept c as a node 0 1 0 Local Compatability 1 1 Coreference 0 1 1 Semantic Relatedness yi: the label of node i W: weight matrix of the relational graph The model (Adapted from Zhu2003) 29 Relevant Mention Detection: Meta Path A meta-path is a path defined over a network and composed of a sequence of relations between different object types (Sun et al., 2011) o Each meta path represent a semantic relation Meta paths between mention and mention o o o o o M-T-M M-T-U-T-M-M M-T-H-T-M M-T-U-T-M-T-H-T-M M-T-H-T-M-T-U-T-M Schema of a Heterogeneous Information Network in Twitter M: mention, T: tweet, U: user, H: hashtag 30 Relational Graph Construction gators, Florida Gators men's basketball hawks, Atlanta Hawks 0.43 0.91 0.32 hawks, Atlanta Hawks hawks, Hawk bucks, Milwaukee Bucks 0.89 hawks, Hawk 0.62 tonight, Tonight days, Day 0.55 0.87 now, Now Local Compatibility o Mention Features (e.g., idf, keyphraseness) o Concept Features (e.g., # of incoming/outgoing links) o Mention + Concept Features (e.g., prior popularity, tf) o Context Features (e.g., capitalization, tf-idf) 31 Relational Graph Construction (con’t) gators, Florida Gators men's basketball hawks, Atlanta Hawks bucks, Milwaukee 0.91 0.32 Bucks 1.0 hawks, Atlanta Hawks 0.43 hawks, Hawk 1.0 0.89 hawks, Hawk 0.62 tonight, Tonight days, Day 0.55 0.87 now, Now Coreference o At least one meta path exists between two similar mentions 32 Relational Graph Construction (con’t) 0.44 hawks, Atlanta Hawks 1.0 0.52 0.68 0.91 0.430.32 bucks, Milwaukee Bucks hawks, Atlanta Hawks 0.68 hawks, Hawk 1.0 0.89 hawks, Hawk 0.62 tonight, Tonight days, Day 0.55 0.87 now, Now Semantic Relatedness (SR) o o gators, Florida Gators men's basketball SR between two mentions: meta path SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008) Linear Combination of these three graphs 33 A Deep Semantic Relatedness Model (DSRM) Semantic Knowledge Graphs Erik Spoelstra Description Coach Miami 1988 Miami Titanic Heat Location Founded Roster Dwyane Wade Type Member National Basketball Association Professional Sports Team 34 The DSRM Architecture Semantic relatedness (cosine similarity) Semantic Layer SR(ci , cj) y Multi-layer nonlinear projections 300 300 300 300 300 300 105k (50k + 50k + 3.2k + 1.6k) Word Hashing Layer x Feature Vector 1m Di 4m Ci 3.2k Ri 105k (50k + 50k + 3.2k + 1.6k) 1.6k CTi 1m Dj Miam Location i Roster Dwyane Wade Miami Titanic Heat 4m Cj 3.2k Rj 1.6k CTj Type Professional Sports Team Member National Basketball Association 35 Data and Scoring Metric Data o o o o A Wikipedia dump on May 3, 2013 A portion of Freebase limited to the Wikipedia concepts Wikification: a public data set includes 502 messages from 28 users (Meij et al., 2012) Semantic relatedness: a benchmark testset includes 3,314 concepts as testing queries (Ceccarelli et al., 2013) Scoring Metric o Wikification o Standard precision, recall and F1 Semantic relatedness nDCG 36 Models for Comparison TagMe: an unsupervised model based on prior popularity and semantic relatedness of a single message (Ferragina and Scaiella, 2010) Meij: the state-of-the-art supervised approach based on the random forest model (Meij et al., 2012) SSRegu: our proposed semi-supervised graph regularization model with all three types of relations 37 Overall Performance (ACL’14) Meij: use 100% labeled data SSRegu: use 50% labeled data 7.5% absolute F1 gain over the state-of-the-art supervised models 65.0% 59.0% 59.8% 52.5% 47.5% 51.6% 44.1% 42.3% 37.0% 55.0% 39.3% 32.9% TagMe Meij SSRegu + M&W SSRegu + DSRM 38 Quality of Semantic Relatedness (ACL’15 Submission) DSRM Standard Relatedness Method M&W (Milne and Witten, 2008) 39 Semantic Relatedness: Examples Method M&W DSRM New York City 0.92 0.22 New York Knicks 0.78 0.79 Washington, D.C. 0.80 0.30 Washington Wizards 0.60 0.85 Atlanta 0.71 0.39 Atlanta Hawks 0.53 0.83 Houston 0.55 0.37 Houston Rockets 0.49 0.80 Semantic relatedness scores between a sample of concepts and the concept ”National Basketball Association” in sports domain. 40 Impact of Semantic Relatedness on Concept Disambiguation News dataset: 4,485 mentions (Hoffart et al., 2011) AIDA: a unsupervised collective inference method (Hoffart et al., 2011) Our methods are completely unsupervised TagMe Meij SSRegu SSRegu + M&W + DSRM Tweet Set AIDA SSRegu SSRegu + M&W + DSRM 41 News Dataset Remaining Challenges Mention detection is performance bottleneck Mention disambiguation: city and country names that refer to sports teams (e.g., “Miami” -> “Miami Heat”) o Incorporate user interests Non-linkable entity mention recognization and clustering Error Distribution 42 Contribution IV: A HIN-based Resolution Model 43 Target Candidate Identification Considering all entities will be too overwhelming o Make resolution difficult and affect system efficiency Temporal Distribution Assumption o Intuition: social users should know the real targets before they use morphs o Assume the target candidates should appear within certain time period (e.g., 7 days) of the morph o Naïve but greatly narrow down candidates into 1% and keep 92% of all targets 44 Target Candidate Ranking: Motivating Example Conquer West King from Chongqing fell from power, still need to sing red songs? There is no difference between that guy’s plagiarism and Not Thick’s gang crackdown. Remember that Not Thick said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported by his scholarship. Weibo (censored) Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown. The webpage of “Tianze Economic Study Institute” owned by the liberal party has been closed. This is the first affected website of the liberal party after Bo Xilai fell from power. Bo Xilai gave an explanation about the source of his son, Bo Guagua’s tuition. Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs. Twitter and Chinese News (uncensored) 45 Heterogeneous Information Network Example of Morph-Related Heterogeneous Information Network Network Schema M: Morphs E: Entities EV: Events NP: Non-Entity Noun Phrases Three types of Meta-paths: o M–E–E o M – EV – E o M – NP – E Each meta-path provides a unique angle to measure how 46 similar two objects are Meta Path-based Similarity Measures Common Neighbors: the number of common neighbors between a morph m and a target e Path Count: the number of paths between m and e Pairwise Random Walk p1 o m x p2 e Kullback-Leibler Distance 47 Integrate Cross Source/Cross Genre Information Comparisons of Weibo and Twitter o o o o Weibo: Already put in prison, do we still need to serve Not Thick? Twitter: ...call Bo Xilai “conquer west king” or “Not Thick”... Information from media not under censorships is more explicit Integrate information from Twitter to help morph resolution Integrate information from cross genre web documents o o Richer and cleaner information Existing NLP tools work better 48 Learning-to-Rank Logistic Regression model to combine different set of features Morph Target LCS CN PRW Social … Label Conquer West King Bo Xilai 0 100 0.4 0.6 … 1 Conquer West King Wang Lijun 1 50 0.3 0.6 … 0 Conquer West King Obama 0 4 0.001 0.0 … 0 49 Data and Scoring Metric Data o o o o o Time frame: 05/01/2012-06/30/2012 1555K Chinese messages from Weibo 66K formal web documents from embedded URL 25K Chinese messages from English Twitter for sensitive morphs Test on 107 morph entities in Weibo, 23 of them are sensitive Scoring Metric Acc @ k Ck / T o o Ck: the number of correctly resolved morphs at top position k T: the total number of morphs in ground truth 50 Overall performance (ACL’13) 70.1% 65.9% 59.4% 51.9% 47.7% 41.6% 37.9% 23.4% 1 5 10 20 Homogeneous Network-based Method(Hiung et al., 2005) 1 5 10 20 Our HIN-based Approach 51 Remaining Challenges Morph and non-morph ambiguity o o Need deeper profile understanding o o Unique: mainly used as morphs (e.g., Governor Bo) Common: used as both morphs and non-morphs (e.g., Baby and President) E.g., capture family relations E.g., ensure type consistency Morph popularity is not correlated with resolution performance Unique Common Morph Resolution Performace 52 Conclusions We designed various HIN-based methods to enhance natural language understanding in microblogging o Alleviate information noiseness, brevity, informarity and implicitness problems We proved that modeling HINs is also powerful in various NLP tasks on microblogs o o o Combined existing social relations and deep content analysis methods to construct richer and cleaner HINs Designed and explored various novel methods to model HINs Significantly outperform various existing NLP methods 53 Future Directions Explore and model HINs in other genre of data (e.g., News) Knowledge transferring from semantic knowledge graphs with deep learning for information extraction Knowledge Representations Knowledge Graphs Texts 54 Related Publications H. Huang, L. Heck, and H. Ji, Leveraging Deep Neural Networks and Knowledge Graphs for Entity Disambiguation. ACL2015 submission (full). B. Zhang, H. Huang, X. Pan, H. Ji, K. Knight, Z. Wen, Y. Sun, J. Han and B. Yener. 2014. Be Appropriate and Funny: Automatic Entity Morph Encoding. ACL2014. (short). [3 Citations] H. Huang, Y. Cao, X. Huang, H. Ji, C. Lin. 2014. Collective Tweet Wikification based on Semi-supervised Graph Regularization. ACL2014. (full) [6 Citations] H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han and H. Li. 2013. Resolving Entity Morphs in Censored Data. ACL2013. (full) [12 Citations] H. Huang, A. Zubiaga, H. Ji, H. Deng, D. Wang, H. Le, T. Abdelzaher, J. Han, A. Leung, J. Hancock and C. Voss. 2012. Tweet Ranking based on Heterogeneous Networks. COLING2012. (full) [13 Citations] 55 Impact of This Thesis The idea of modeling HINs for NLP has been exploited by some recent work in NLP community o o Yu et al., (2014) exploited a similar framework of our microblog ranking model and achieved the state-of-the-art slot filling validation performance Zhang et al., (2014) modeled HINs with content information to enhance information recommendation The morph work has inspired several study on this particular langauge o o Chen et al., (2013) examined the impact of active censorship on language usuage in microblogging Hiruncharoenvate et al., (2015) designed algorithms to bypass cencorship 56 Thank You! Questions? 57