Overview - Rensselaer Polytechnic Institute

advertisement
Modeling Heterogeneous Networks for Information
Ranking, Enrichment and Resolution on Microblogs
Hongzhao Huang
huangh9@rpi.edu
Advisor: Dr. Heng Ji
Computer Science Department
Rensselaer Polytechnic Institute
April 9, 2015
Doctoral Committee:
Dr. Heng Ji (Chair, RPI)
Dr. Peter Fox (RPI)
Dr. James Hendler (RPI)
Dr. Chin-Yew Lin (MSR)
Dr. Yizhou Sun (NEU)
Outline

Introduction
o
o
o
o

Contribution I: A HIN-based Ranking Model
o


The Models + Evaluation
Contributions IV: A HIN-based Resolution Model
o

The Model + Evaluation
Contributions II & III: HIN-based Linking and
Semantic Relatedness Models
o

Background
Overall Problem
Overview of State-of-the-art Approaches
Contributions
The Model + Evaluation
Conclusions and Future Directions
Related Publications
2
Background

Heterogeneous Information Network (HIN)
o

Contain multiple types of objects/relations
Homogeneous Information Network
o
Contain single type of objects and relations
DBLP Bibliographic HIN
Co-author homogeneous network
3
Modeling HINs is Powerful in Data Mining

Advantages
o
o
Incorporate richer information
Differentiate multi-typed objects and linked
relations
DBLP Bibliographic HIN
Co-author homogeneous network
4
Modeling HINs is Powerful in Data Mining

In Data Mining
o
o
o
o
o

Ranking: (Deng2009; Sun2009a)
Clustering: (Sun2009b; Sun2012; Deng2011)
Similarity and Link Analysis: (Sun2011a; Sun2011b)
Classification: (Ji2010; Kong2012)
Most based on existing clean and rich HINs (e.g.,
DBLP Bibliographic Network)
Leveraging HINs is challenging in NLP
o
o
NLP mainly focuses on more fine-grained units such
as words and phrases in unstructured texts
In most cases, clean and rich HINs do not exist
In this thesis, we aim to explore whether modeling
HINs is also powerful in NLP
5
Microblogging

Some facts about Twitter:
o
o
o
o

An example microblog on Hurricane Irene 2011
o

288 million monthly active users
500 million tweets per day
1000 users on average to a retweeted message (Kwak
et al., 2010)
A retweeted message will be disseminated instantly
on next hops (Kwak et al., 2010)
across the street is an evacuation zone, but my side of
the street isn't. here's to the hurricane coloring in the
lines: http://t.co/uiavHQh #irene
An unique information resource
o
o
o
Real-time, diverse, detailed…
Contain super-fresh information
A fast information diffusion platform
6
Characteristic One: Noisy
Pear Analytics 2009 reported on 2000 Tweets
7
Characteristic Two: Short

Maximum 140 characters in each single message
o

Information brevity is pervasive
We must get out of this Slump. We have to
stay together. Go Hawks!
8
Characteristic Three: Informal and More
Implicit Information

Microblog posts are informal and tend to contain
more implicit information
o
Free usage of languages
 Miss-spellings
 Informal/implicit terms
=
“Conquer West King”
(平西王)

=
“Bo Xilai”
(薄熙来)
“Baby”
(宝宝)
“Wen Jiabao”
(温家宝)
We call this phenomenon ``Information Morphing”
9
Characteristic Three: Informal and More
Implicit Information

Will King and KD burn out?
takes a look at the fatigue factor
entering the playoffs.

I think the Good Doctor is too
crazy to hang it up
Lebron James
Ron Paul
10
Overall Problem


Goal: design effective approaches to enhance
natural language understanding in microblogging
Three important sub-problems
o
o
o
Identify informative information
 To solve the information noiseness problem
Enrich information from a knowledge base (e.g.,
Wikipedia and Freebase) with rich and clean
background knowledge
 To solve the information brevity problem
Resolve informal and more implicit information
 To solve the information informality and
implicitness problem
11
Sub-problem 1: Identify Informative Information
Ranking Microblogs based on
Informativeness

After temporal and spatial constraints, informative
to a general audience or helpful for event tracking
o
o
o

Informative Microblog Examples
o
o

Breaking news
Real-time coverage of ongoing events
…
New Yorkers, find your exact evacuation zone by your address
here: http://t.co/9NhiGKG /via @user #Irene #hurricane #NY
Details of Aer Lingus flights affected by Hurricane Irene can be found
at http://t.co/PCqE74\u201d
Uninformative Microblog Examples
o
o
Me, Myself, and Hurricane Irene.
I'm ready For hurricane Irene.
12
Sub-problem 2: Information Enrichment from a KB
Wikification for Microblogs

Identify linkable mentions from a microblog and
disambiguate them to their referent concepts in a
Knowledge Base



A mention: a phrase referring to a concept in the world
A concept: a page in a Knowledge Base
We must get out of this Slump. We have to stay
together. Go Hawks!
13
Sub-problem 3: Resolve Informal and More Implicit Information
Morph Resolution


Goal: automatically determine which term is
used as a morph, and resolve it to its regular
referent
Conquer West King from Chongqing
fell from power, do we still need to
sing red songs?
14
Heterogeneous Networks in Microblogging
Web Documents
Microblogs
Social User Community
Semantic
Relationship
Semantic
Relationship
Semantic
Relationship
Social User Community
Knowledge
Base
Concept Mentions
Leveraging and modeling HINs to enhance natural language
understanding in microblogging
15
State-of-the-art Approaches and Limitations

Link-based ranking or similarity methods based on
homogeneous networks (Hisiung et al., 2005; Milne
and Witten, 2008; Huang et al., 2011, Mihalcea and
Tarau, 2004)
o
o
Ignore discrepancies between multi-typed objects and
linked relations
Ignore cross-genre and cross-type information
Modeling HINs to incorporate richer information
and capture their discrepancies
16
State-of-the-art Approaches and Limitations

Supervised ranking or linking models with
multiple levels of features (e.g., content and
social features) (Duan et al., 2010; Meij et al.,
2012; Guo et al., 2013)
o
o
Require a large amount of training data
Ingore global evidence from multiple posts
Modeling HINs to incorporate global evidence and
perform collective inference over both labeled and
unlabeled data to save annotation cost
17
Contributions

A HIN-based ranking model that significantly improves
microblog ranking quality
o

A HIN-based linking model that dramatically saves
annotation cost and achieves better performance for
wikification
o
o
A new deep semantic relatedness model that captures latent semantics
of concepts is developed
A HIN-based resolution model that substantially outperforms
existing alias detection method
o

A new collective inference model is designed that incorporates global
evidence and leverage a large amount of unlabeled data
A HIN-based semantic relatedness model that significantly
enhances both relatedness and disambiguation quality
o

A new unsupervised propagation model is developed to rank
microblogs, web documents, and users simultaneously
Directly model unstructured texts with HINs
An uncommon effort to explore heterogeneous networks to
improve NLP approaches
18
Contribution I: A HIN-based Ranking Model
19
Hypotheses



Interdependencies exist between multi-typed objects
Hypothesis 1: Informative microblogs are more likely to
be posted by authoritative users; and vice versa
(authoritative users are more likely to post informative
microblogs)
Hypothesis 2: Microblogs involving many users are
more likely to be informative
o Similar microblogs appear with high frequency
o Synchronous behavior of users indicates informative
information
20
Hypotheses

Hypothesis 3: Microblogs aligned with contents of web
documents are more likely to be informative
o
o
o
New Yorkers, find your exact evacuation zone by your address here:
http://t.co/9NhiGKG /via @user #Irene \#hurricane \#NY
Details of Aer Lingus flights affected by Hurricane Irene can be found
at http://t.co/PCqE74V\u201d
Hurricane Irene: City by City Forecasts http://t.co/x1t122A
21
Tri-HITS: Ranking Microblogs based on HINs
Context Similarity
(cosine similarity and tf-idf)
Explicit Links are sparse
Web Documents
Microblogs

Users
Infer implicit microblog-user relations
o
U1 posts M1, if sim(M1,M2) exceeds an threshold,
an edge is created for U1 and M2
22
Tri-HITS: preliminaries
Similarity matrix Wdt
Transition matrix Pdt
Heterogeneous Networks
Initial ranking scores
S0(d)
S0(m)
S0(u)
0.45
M1
D1
0.5
0.8
D2
1.0
U1
M2
1.0
0.1
M3
U2
Implicit links between microblogs
and web documents:
Wmd Wdm
Explicit and implicit links
between microblogs and users:
Wmu Wum
1.0
23
Propagation from Microblogs to Web Documents

Tri-HITS: based on the similarity matrix

Co-HITS: based on transition matrix (Deng et al., 2009)

Differences between Tri-HITS and Co-HITS:
o
o
o
Tri-HITS normalize the propagated ranking scores based on
original similarity matrix
Co-HITS propagates normalized ranking scores using the
transition matrix
Co-HITs weakens or damages the semantic meaning of implicit
links in our experimental setting
24
Tri-HITS (con’t)

Propagation from microblogs to users

Propagation from web documents and users to microblogs
Set
Set
to 0 will only consider microblog-user networks
to 0 will only consider web-microblog networks
25
Data and Scoring Metric

Data
o
o
o

Monitored 3,460 microblogs posted on different days
Two annotators assigned each microblog a score of 1-5 in
parallel, initial agreement is 66%; adjudicated until
difference <=1, take lower grade
Criteria
 Whether the microblog is likely to be news?
 Does the microblog include information that a general
audience will be concerned about during an event?
 The relative informativeness in the data pool
Label
5
4
3
2
1
Hour 1
65
48
93
119
847
Hour 2
135
159
255
164
458
Hour 3
129
102
162
123
602
Distribution of
Grades
Evaluation Metric: nDCG
o
Combine informativeness and ranking position
26
Overall Performance (COLING’12)
 Evidence from multigenre networks improves
TextRank significantly
 Knowledge transferred from
the Web and Social
Networks dramatically
boosted quality
 Modeling Heterogeneous
Networks is effective
27
Contribution II & III: HIN-based Linking
and Semantic Relatedness Models
28
Collective Wikificaion based on Semi-supervised
Graph Regularization

Relational graph
o
Each pair of mention m and concept c as a node
0
1
0
Local Compatability
1
1
Coreference
0
1
1

Semantic Relatedness


yi: the label of node i
W: weight matrix of the
relational graph
The model (Adapted from Zhu2003)
29
Relevant Mention Detection: Meta Path

A meta-path is a path defined over a network
and composed of a sequence of relations
between different object types (Sun et al., 2011)
o

Each meta path represent a semantic relation
Meta paths between
mention and mention
o
o
o
o
o
M-T-M
M-T-U-T-M-M
M-T-H-T-M
M-T-U-T-M-T-H-T-M
M-T-H-T-M-T-U-T-M
Schema of a Heterogeneous
Information Network in Twitter
M: mention, T: tweet, U: user, H: hashtag
30
Relational Graph Construction
gators,
Florida Gators
men's basketball
hawks,
Atlanta Hawks
0.43
0.91
0.32
hawks,
Atlanta Hawks

hawks,
Hawk
bucks,
Milwaukee
Bucks
0.89
hawks,
Hawk
0.62
tonight,
Tonight
days,
Day
0.55
0.87
now,
Now
Local Compatibility
o Mention Features (e.g., idf, keyphraseness)
o Concept Features (e.g., # of incoming/outgoing links)
o Mention + Concept Features (e.g., prior popularity, tf)
o Context Features (e.g., capitalization, tf-idf)
31
Relational Graph Construction (con’t)
gators,
Florida Gators
men's basketball
hawks,
Atlanta Hawks
bucks,
Milwaukee
0.91
0.32
Bucks
1.0
hawks,
Atlanta Hawks

0.43
hawks,
Hawk
1.0
0.89
hawks,
Hawk
0.62
tonight,
Tonight
days,
Day
0.55
0.87
now,
Now
Coreference
o
At least one meta path exists between two similar mentions
32
Relational Graph Construction (con’t)
0.44
hawks,
Atlanta Hawks
1.0

0.52
0.68
0.91 0.430.32
bucks,
Milwaukee
Bucks
hawks,
Atlanta Hawks 0.68
hawks,
Hawk
1.0 0.89
hawks,
Hawk
0.62
tonight,
Tonight
days,
Day
0.55
0.87
now,
Now
Semantic Relatedness (SR)
o
o

gators,
Florida Gators
men's basketball
SR between two mentions: meta path
SR between two concepts: link structure in Wikipedia (Milne and
Witten, 2008)
Linear Combination of these three graphs
33
A Deep Semantic Relatedness Model (DSRM)
Semantic Knowledge Graphs
Erik
Spoelstra
Description
Coach
Miami
1988
Miami
Titanic
Heat
Location
Founded
Roster
Dwyane
Wade
Type
Member
National Basketball
Association
Professional
Sports Team
34
The DSRM Architecture
Semantic relatedness
(cosine similarity)
Semantic Layer
SR(ci , cj)
y
Multi-layer nonlinear projections
300
300
300
300
300
300
105k (50k + 50k + 3.2k + 1.6k)
Word Hashing Layer
x
Feature Vector
1m
Di
4m
Ci
3.2k
Ri
105k (50k + 50k + 3.2k + 1.6k)
1.6k
CTi
1m
Dj
Miam Location
i
Roster
Dwyane
Wade
Miami
Titanic
Heat
4m
Cj
3.2k
Rj
1.6k
CTj
Type Professional
Sports Team
Member
National Basketball
Association
35
Data and Scoring Metric

Data
o
o
o
o

A Wikipedia dump on May 3, 2013
A portion of Freebase limited to the Wikipedia
concepts
Wikification: a public data set includes 502 messages
from 28 users (Meij et al., 2012)
Semantic relatedness: a benchmark testset includes
3,314 concepts as testing queries (Ceccarelli et al.,
2013)
Scoring Metric
o
Wikification

o
Standard precision, recall and F1
Semantic relatedness

nDCG
36
Models for Comparison



TagMe: an unsupervised model based on prior
popularity and semantic relatedness of a single
message (Ferragina and Scaiella, 2010)
Meij: the state-of-the-art supervised approach
based on the random forest model (Meij et al.,
2012)
SSRegu: our proposed semi-supervised graph
regularization model with all three types of
relations
37
Overall Performance (ACL’14)


Meij: use 100% labeled data
SSRegu: use 50% labeled data
7.5% absolute F1 gain over the
state-of-the-art supervised models
65.0%
59.0%
59.8%
52.5%
47.5%
51.6%
44.1%
42.3%
37.0%
55.0%
39.3%
32.9%
TagMe
Meij
SSRegu
+ M&W
SSRegu
+ DSRM
38
Quality of Semantic Relatedness (ACL’15
Submission)
DSRM
Standard Relatedness Method M&W
(Milne and Witten, 2008)
39
Semantic Relatedness: Examples
Method
M&W
DSRM
New York City
0.92
0.22
New York Knicks
0.78
0.79
Washington, D.C.
0.80
0.30
Washington Wizards
0.60
0.85
Atlanta
0.71
0.39
Atlanta Hawks
0.53
0.83
Houston
0.55
0.37
Houston Rockets
0.49
0.80
Semantic relatedness scores between a sample of concepts
and the concept ”National Basketball Association” in sports domain.
40
Impact of Semantic Relatedness on Concept
Disambiguation



News dataset: 4,485 mentions (Hoffart et al., 2011)
AIDA: a unsupervised collective inference method (Hoffart et al., 2011)
Our methods are completely unsupervised
TagMe
Meij
SSRegu SSRegu
+ M&W + DSRM
Tweet Set
AIDA
SSRegu SSRegu
+ M&W + DSRM
41
News Dataset
Remaining Challenges


Mention detection is performance bottleneck
Mention disambiguation: city and country names that
refer to sports teams (e.g., “Miami” -> “Miami Heat”)
o

Incorporate user interests
Non-linkable entity mention recognization and clustering
Error Distribution
42
Contribution IV: A HIN-based Resolution Model
43
Target Candidate Identification


Considering all entities will be too overwhelming
o Make resolution difficult and affect system efficiency
Temporal Distribution Assumption
o Intuition: social users should know the real targets
before they use morphs
o Assume the target candidates should appear within
certain time period (e.g., 7 days) of the morph
o Naïve but greatly narrow down candidates into 1%
and keep 92% of all targets
44
Target Candidate Ranking: Motivating Example



Conquer West King from
Chongqing fell from power, still
need to sing red songs?
There is no difference between that
guy’s plagiarism and Not Thick’s
gang crackdown.
Remember that Not Thick said that
his family was not rich at the press
conference a few days before he
fell from power. His son Bo
Guagua is supported by his
scholarship.
Weibo (censored)




Bo Xilai: ten thousand letters of
accusation have been received during
Chongqing gang crackdown.
The webpage of “Tianze Economic
Study Institute” owned by the liberal
party has been closed. This is the first
affected website of the liberal party
after Bo Xilai fell from power.
Bo Xilai gave an explanation about the
source of his son, Bo Guagua’s
tuition.
Bo Xilai led Chongqing city leaders
and 40 district and county party and
government leaders to sing red songs.
Twitter and Chinese News (uncensored)
45
Heterogeneous Information Network
Example of Morph-Related Heterogeneous
Information Network
Network Schema
M: Morphs
E: Entities
EV: Events
NP: Non-Entity Noun Phrases


Three types of Meta-paths:
o M–E–E
o M – EV – E
o M – NP – E
Each meta-path provides a unique angle to measure how
46
similar two objects are
Meta Path-based Similarity Measures



Common Neighbors: the number of common neighbors
between a morph m and a target e
Path Count: the number of paths between m and e
Pairwise Random Walk
p1
o

m
x
p2
e
Kullback-Leibler Distance
47
Integrate Cross Source/Cross Genre
Information

Comparisons of Weibo and Twitter
o
o
o
o

Weibo: Already put in prison, do we still need to serve Not
Thick?
Twitter: ...call Bo Xilai “conquer west king” or “Not
Thick”...
Information from media not under censorships is more
explicit
Integrate information from Twitter to help morph
resolution
Integrate information from cross genre web
documents
o
o
Richer and cleaner information
Existing NLP tools work better
48
Learning-to-Rank
Logistic Regression model to combine different
set of features

Morph
Target
LCS
CN
PRW
Social
…
Label
Conquer West
King
Bo Xilai
0
100
0.4
0.6
…
1
Conquer West
King
Wang
Lijun
1
50
0.3
0.6
…
0
Conquer West
King
Obama
0
4
0.001
0.0
…
0
49
Data and Scoring Metric

Data
o
o
o
o
o

Time frame: 05/01/2012-06/30/2012
1555K Chinese messages from Weibo
66K formal web documents from embedded URL
25K Chinese messages from English Twitter for sensitive
morphs
Test on 107 morph entities in Weibo, 23 of them are
sensitive
Scoring Metric
Acc @ k  Ck / T
o
o
Ck: the number of correctly resolved morphs at top
position k
T: the total number of morphs in ground truth
50
Overall performance (ACL’13)
70.1%
65.9%
59.4%
51.9%
47.7%
41.6%
37.9%
23.4%
1
5
10
20
Homogeneous Network-based
Method(Hiung et al., 2005)
1
5
10
20
Our HIN-based Approach
51
Remaining Challenges

Morph and non-morph ambiguity
o
o

Need deeper profile understanding
o
o

Unique: mainly used as morphs (e.g., Governor Bo)
Common: used as both morphs and non-morphs (e.g., Baby and
President)
E.g., capture family relations
E.g., ensure type consistency
Morph popularity is not
correlated with resolution
performance
Unique
Common
Morph Resolution Performace 52
Conclusions

We designed various HIN-based methods to enhance
natural language understanding in microblogging
o

Alleviate information noiseness, brevity, informarity and
implicitness problems
We proved that modeling HINs is also powerful in
various NLP tasks on microblogs
o
o
o
Combined existing social relations and deep content
analysis methods to construct richer and cleaner HINs
Designed and explored various novel methods to model
HINs
Significantly outperform various existing NLP methods
53
Future Directions


Explore and model HINs in other genre of data (e.g.,
News)
Knowledge transferring from semantic knowledge
graphs with deep learning for information
extraction
Knowledge Representations
Knowledge Graphs
Texts
54
Related Publications





H. Huang, L. Heck, and H. Ji, Leveraging Deep Neural Networks
and Knowledge Graphs for Entity Disambiguation. ACL2015
submission (full).
B. Zhang, H. Huang, X. Pan, H. Ji, K. Knight, Z. Wen, Y. Sun, J. Han
and B. Yener. 2014. Be Appropriate and Funny: Automatic Entity
Morph Encoding. ACL2014. (short). [3 Citations]
H. Huang, Y. Cao, X. Huang, H. Ji, C. Lin. 2014. Collective Tweet
Wikification based on Semi-supervised Graph Regularization.
ACL2014. (full) [6 Citations]
H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han and H. Li. 2013.
Resolving Entity Morphs in Censored Data. ACL2013. (full) [12
Citations]
H. Huang, A. Zubiaga, H. Ji, H. Deng, D. Wang, H. Le, T.
Abdelzaher, J. Han, A. Leung, J. Hancock and C. Voss. 2012. Tweet
Ranking based on Heterogeneous Networks. COLING2012. (full)
[13 Citations]
55
Impact of This Thesis

The idea of modeling HINs for NLP has been exploited
by some recent work in NLP community
o
o

Yu et al., (2014) exploited a similar framework of our
microblog ranking model and achieved the state-of-the-art
slot filling validation performance
Zhang et al., (2014) modeled HINs with content
information to enhance information recommendation
The morph work has inspired several study on this
particular langauge
o
o
Chen et al., (2013) examined the impact of active
censorship on language usuage in microblogging
Hiruncharoenvate et al., (2015) designed algorithms to
bypass cencorship
56
Thank You!
Questions?
57
Download