Poster

advertisement
Multi-view Exploratory Learning
for AKBC Problems
Bhavana Dalvi and William W. Cohen
School Of Computer Science, Carnegie Mellon University
2
Multi-view Exploratory EM
οƒ˜ Traditional EM method for SSL jointly learns missing
labels of unlabeled data points as well as model
parameters.
Inputs: 𝑿 : Labeled data points 𝑿
οƒ˜ We consider two extensions of traditional EM for SSL:
Outputs: {𝜽𝟏 … πœ½π’Œ+π’Ž }: Parameters for k seed and m newly added classes;
𝒍
(𝟏)
(𝒗)
 Iterate till convergence (Data likelihood and number of classes)
 E Step (Iteration t): Predict labels for unlabeled data points
For i = 1 : N
𝑷 π‘ͺ𝒋 π‘Ώπ’Š ) =
(𝟏)…(𝒗)
CombineMultiViewScore(π‘Ώπ’Š
(𝟏)…(v)
, πœ½π’‹
)
If NewClassCreationCriterion(𝑷 π‘ͺ𝒋 π‘Ώπ’Š ), 𝒁𝒕 )
Example use-case
of Exploratory EM
Create a new class π‘ͺπ’π’†π’˜ , assign π‘Ώπ’Š to it
𝒁𝒕 = UpdateConstraints(𝒁𝒕 , π‘ͺπ’π’†π’˜ , {𝒀𝒍 , 𝒀𝒖 })
π‘Œπ‘–π‘‘ = OptimalLabelAssignment 𝑷 π‘ͺ𝒋 π‘Ώπ’Š , 𝑍 𝑑 )
Location
 M step: Re-compute model parameters πœ½π’•+𝟏
using
𝒋
seeds and predicted labels for unlabeled data points π‘Œπ‘–π‘‘ .
Food
0.1 0.9
C8
State
Country
Vegetable
0.55
Number of classes might increase in each iteration.
Condiment
 Check if model selection criterion is satisfied.
0.45
Modeling
3
Unobserved Classes
Dynamically introducing new classes
οƒ˜ Hypothesis: Dynamically inducing clusters of data-points
that do not belong to any of the seeded classes will
reduce the semantic drift.
οƒ˜ For each data-point 𝑋𝑖 , we compute posterior distribution
𝑃 𝐢𝑗 𝑋𝑖 ) of 𝑋𝑖 belonging to any of the existing classes
𝐢1 … πΆπ‘˜ [Dalvi et al., ECML’13]
οƒ˜ Criterion 1 : MinMax
If not, revert to model in Iteration `t-1’
Incorporating Multiple Views
4
and Ontological Constraints
Multiple
Data Views
οƒ˜
Each data point and class centroid or classifier has
(𝑣)
(1)
(𝑣)
1
representation in multiple views π‘₯𝑖 … π‘₯𝑖
and 𝐢𝑗 … 𝐢𝑗
οƒ˜ E.g. In the noun phrase classification task, we consider co-occurrences
of NPs in text sentences (View-1) and HTML tables (View-2).
οƒ˜ Combining scores from multiple views
 Sum-Score: Addition of scores
 Prod-Score: Product of scores
 Max-Agree: Maximize agreement between per view label
assignments [Dalvi and Cohen, in submission]
π‘šπ‘Žπ‘₯𝑃 = π‘šπ‘Žπ‘₯(𝑃 𝐢𝑗 𝑋𝑖 )),
π‘šπ‘–π‘›π‘ƒ = π‘šπ‘–π‘›(𝑃 𝐢𝑗 𝑋_𝑖))
If (π‘šπ‘Žπ‘₯𝑃 π‘šπ‘–π‘›π‘ƒ < 2) → Create a new class/cluster
οƒ˜ Criterion 2 : JS (Jensen–Shannon divergence)
𝑒𝑛𝑖𝑃 = uniform distribution over k classes
𝑗𝑠𝐷𝑖𝑣 = JS−Divergence(𝑒𝑛𝑖𝑃, 𝑃(𝐢𝑗 |𝑋𝑖 ))
if (𝑗𝑠𝐷𝑖𝑣 < 1/π‘˜) οƒ  Create a new class/cluster
οƒ˜ For hierarchical classification we also need to decide
where to place this newly created class:
 Divide and conquer method for extending tree
structured ontology [Dalvi et al. AKBC 2013]
 Extension of DAC to extend a generic ontology with
subset and mutual exclusion constraints (OptDAC)
[Dalvi and Cohen, under review]
Ontological Constraints
οƒ˜
Each data point is assigned a bit vector of
labels. Subset and mutual exclusion constraints decide consistency of
potential bit vectors.
οƒ˜ GLOFIN: A mixed integer program is solved for each data point to
get optimal label vector. [Dalvi et al. WSDM 2015]
20 Newsgroups Dataset (#seed classes = 6)
Micro-reading
οƒ˜ Task: To classify an entity mention using context specific features .
οƒ˜ Clustering NIL entities for KBP entity discovery and linking (EDL)
task [Mazaitis et al., KBP 2014]
Multi-view Hierarchical SSL (MaxAgree)
οƒ˜ MaxAgree method exploits clues from different data views.
οƒ˜ We define multi-view clustering as an optimization problem and
compare various methods for combining scores across views.
Correlation w.r.t
MaxAgree method is more robust compared Performance
improvement
difference in views
to Prod-Score method when we vary
over best view Coefficient P-value
difference of performance between views.
Prod-Score
-0.59
0.01
οƒ˜ Our proposed Hier-MaxAgree method can
MaxAgree
-0.05
0.82
incorporate both: the clues from multiple view, and ontological
70
constraints.
65
Concatenation
[Dalvi and Cohen, in submission]
60
οƒ˜ On entity classification for
Co-training
55
NELL KB, our proposed
Sum-Score
50
Hier-MaxAgree method
45
Prod-Score
gave state-of-the-art
40
performance.
5 10 15 20 25 30
Hier-MaxAgree
Training Percentage
Hierarchical Exploratory Learning (OptDAC)
οƒ˜ We proposed OptDAC that can do hierarchical SSL in the presence of
incomplete class ontologies.
οƒ˜ It employs mixed integer programming formulation to find optimal label
assignments for a data point, while traversing the class ontology in topdown fashion to detect whether a new class needs to be added and
where to place it. [Dalvi and Cohen, under review]
Text-patterns
+ Ontology-1
Text-patterns
+ Ontology-2
HTML-tables +
Ontology-1
HTML-tables
+ Ontology-2
Model Selection
οƒ˜ This step makes sure that we do not create too many
new classes.
οƒ˜ We tried BIC, AIC, and AICc criteria, and Extended AIC
(AICc) worked best for our tasks.
AICc(g) = AIC(g) + 2 * v * (v+1) / (n – v -1)
Here g: Model being evaluated, L(g): Log-likelihood of
data given g, v: Number of free parameters of the model,
n: Number of data points.
Macro-reading
οƒ˜ Semi-supervised classification of noun-phrases
into categories, using distributional features.
(Explore-EM)
οƒ˜ Exploratory learning can reduce semantic drift of seed classes.
[Dalvi et al. ECML 2013]
 Initialize the model πœ½πŸŽπ’‹ with a few seeds per class π‘ͺ𝒋
οƒ˜ Our proposed framework combines structural
search for the best class hierarchy with SSL, reducing
the semantic draft associated with erroneously
grouping unanticipated classes with expected classes.
Root
οƒ˜ Optimized Divide and Conquer (OptDAC): Here we combine
1) divide and conquer based top-down strategy to detect and place
new categories in the ontology, with
2) mixed integer programming technique (GLOFIN) to select optimal
set of labels for a data point, consistent w.r.t. ontological constraints.
An example of extended
ontology by OptDAC
Automatic gloss finding for KBs (GLOFIN)
Different Document Representations
80
70
60
50
40
30
20
10
0
οƒ˜ Naïve Bayes: Assumes multinomial distribution for
feature occurrences, explicitly models class prior.
οƒ˜ Seeded K-Means: Similarity based on cosine
distance between centroids and data points
οƒ˜ Seeded von Mises-Fisher: SSL method for data
distributed on the unit hyper-sphere.
6
AKBC tasks
Class 𝒋 can have 𝒗 data views πœ½π’‹ … πœ½π’‹ ;
π’π’Œ+π’Ž : Set of class constraints between k+m classes; 𝒀𝒖 : Labels for 𝑿𝒖 .
 Assigning multiple labels from multiple levels of class
hierarchy while satisfying ontological constraints, and
considering multiple data views.
1.0
𝒍 𝒗
…𝑿
π’Šπ’ 𝒄𝒂𝒔𝒆 𝒐𝒇 𝒗 π’—π’Šπ’†π’˜π’” ;
𝒀𝒍 : Labels of 𝑿𝒍 ; 𝑿𝒖 : Unlabeled data points; N: #data points;
π’Œ:#Seed classes; π’π’Œ : Constraints on k seed classes.
 We consider a new latent variable, unobserved classes,
by dynamically introducing new classes when
appropriate.
Coke
𝒍 𝟏
5
Macro-averaged
F1 score
Motivation
1
SVM
Labal Propagation
GLOFIN-Naïve-Bayes
Precision
Recall
F1
Conclusions
οƒ˜ Exploratory learning helps reduce semantic drift of
seeded classes. It gets more powerful in conjunction with
multiple data views and class hierarchy, when imposed
as soft-constraints on the label vectors.
οƒ˜ It can be applied for multiple AKBC tasks like macroreading, gloss finding, ontology extension etc.
οƒ˜ Datasets and code can be downloaded from:
www.cs.cmu.edu/~bbd/exploratory_learning
οƒ˜ We developed GLOFIN method that takes a gloss-free KB, a large collection of glosses
and automatically matches glosses to entities in the KB. [Dalvi et al. WSDM 2015]
οƒ˜ We used Glosses with only one candidate KB entity (unambiguous glosses) are used as
training data to train hierarchical classification model for categories in the KB. Ambiguous
glosses are then disambiguated based on the KB category they are put in.
οƒ˜ Our method outperformed SVM and a label propagation baseline especially when
amount of training data is small.
οƒ˜ In future: Apply GLOFIN to word sense disambiguation w.r.t. WordNet synset hierarchy.
Acknowledgements : This work is supported by Google PhD Fellowship in Information Extraction and a Google Research Grant.
Download