Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University 2 Multi-view Exploratory EM ο Traditional EM method for SSL jointly learns missing labels of unlabeled data points as well as model parameters. Inputs: πΏ : Labeled data points πΏ ο We consider two extensions of traditional EM for SSL: Outputs: {π½π … π½π+π }: Parameters for k seed and m newly added classes; π (π) (π) ο± Iterate till convergence (Data likelihood and number of classes) οΆ E Step (Iteration t): Predict labels for unlabeled data points For i = 1 : N π· πͺπ πΏπ ) = (π)…(π) CombineMultiViewScore(πΏπ (π)…(v) , π½π ) If NewClassCreationCriterion(π· πͺπ πΏπ ), ππ ) Example use-case of Exploratory EM Create a new class πͺπππ , assign πΏπ to it ππ = UpdateConstraints(ππ , πͺπππ , {ππ , ππ }) πππ‘ = OptimalLabelAssignment π· πͺπ πΏπ , π π‘ ) Location οΆ M step: Re-compute model parameters π½π+π using π seeds and predicted labels for unlabeled data points πππ‘ . Food 0.1 0.9 C8 State Country Vegetable 0.55 Number of classes might increase in each iteration. Condiment οΆ Check if model selection criterion is satisfied. 0.45 Modeling 3 Unobserved Classes Dynamically introducing new classes ο Hypothesis: Dynamically inducing clusters of data-points that do not belong to any of the seeded classes will reduce the semantic drift. ο For each data-point ππ , we compute posterior distribution π πΆπ ππ ) of ππ belonging to any of the existing classes πΆ1 … πΆπ [Dalvi et al., ECML’13] ο Criterion 1 : MinMax If not, revert to model in Iteration `t-1’ Incorporating Multiple Views 4 and Ontological Constraints Multiple Data Views ο Each data point and class centroid or classifier has (π£) (1) (π£) 1 representation in multiple views π₯π … π₯π and πΆπ … πΆπ ο E.g. In the noun phrase classification task, we consider co-occurrences of NPs in text sentences (View-1) and HTML tables (View-2). ο Combining scores from multiple views οΆ Sum-Score: Addition of scores οΆ Prod-Score: Product of scores οΆ Max-Agree: Maximize agreement between per view label assignments [Dalvi and Cohen, in submission] πππ₯π = πππ₯(π πΆπ ππ )), ππππ = πππ(π πΆπ π_π)) If (πππ₯π ππππ < 2) → Create a new class/cluster ο Criterion 2 : JS (Jensen–Shannon divergence) π’πππ = uniform distribution over k classes ππ π·ππ£ = JS−Divergence(π’πππ, π(πΆπ |ππ )) if (ππ π·ππ£ < 1/π) ο Create a new class/cluster ο For hierarchical classification we also need to decide where to place this newly created class: οΆ Divide and conquer method for extending tree structured ontology [Dalvi et al. AKBC 2013] οΆ Extension of DAC to extend a generic ontology with subset and mutual exclusion constraints (OptDAC) [Dalvi and Cohen, under review] Ontological Constraints ο Each data point is assigned a bit vector of labels. Subset and mutual exclusion constraints decide consistency of potential bit vectors. ο GLOFIN: A mixed integer program is solved for each data point to get optimal label vector. [Dalvi et al. WSDM 2015] 20 Newsgroups Dataset (#seed classes = 6) Micro-reading ο Task: To classify an entity mention using context specific features . ο Clustering NIL entities for KBP entity discovery and linking (EDL) task [Mazaitis et al., KBP 2014] Multi-view Hierarchical SSL (MaxAgree) ο MaxAgree method exploits clues from different data views. ο We define multi-view clustering as an optimization problem and compare various methods for combining scores across views. Correlation w.r.t MaxAgree method is more robust compared Performance improvement difference in views to Prod-Score method when we vary over best view Coefficient P-value difference of performance between views. Prod-Score -0.59 0.01 ο Our proposed Hier-MaxAgree method can MaxAgree -0.05 0.82 incorporate both: the clues from multiple view, and ontological 70 constraints. 65 Concatenation [Dalvi and Cohen, in submission] 60 ο On entity classification for Co-training 55 NELL KB, our proposed Sum-Score 50 Hier-MaxAgree method 45 Prod-Score gave state-of-the-art 40 performance. 5 10 15 20 25 30 Hier-MaxAgree Training Percentage Hierarchical Exploratory Learning (OptDAC) ο We proposed OptDAC that can do hierarchical SSL in the presence of incomplete class ontologies. ο It employs mixed integer programming formulation to find optimal label assignments for a data point, while traversing the class ontology in topdown fashion to detect whether a new class needs to be added and where to place it. [Dalvi and Cohen, under review] Text-patterns + Ontology-1 Text-patterns + Ontology-2 HTML-tables + Ontology-1 HTML-tables + Ontology-2 Model Selection ο This step makes sure that we do not create too many new classes. ο We tried BIC, AIC, and AICc criteria, and Extended AIC (AICc) worked best for our tasks. AICc(g) = AIC(g) + 2 * v * (v+1) / (n – v -1) Here g: Model being evaluated, L(g): Log-likelihood of data given g, v: Number of free parameters of the model, n: Number of data points. Macro-reading ο Semi-supervised classification of noun-phrases into categories, using distributional features. (Explore-EM) ο Exploratory learning can reduce semantic drift of seed classes. [Dalvi et al. ECML 2013] ο± Initialize the model π½ππ with a few seeds per class πͺπ ο Our proposed framework combines structural search for the best class hierarchy with SSL, reducing the semantic draft associated with erroneously grouping unanticipated classes with expected classes. Root ο Optimized Divide and Conquer (OptDAC): Here we combine 1) divide and conquer based top-down strategy to detect and place new categories in the ontology, with 2) mixed integer programming technique (GLOFIN) to select optimal set of labels for a data point, consistent w.r.t. ontological constraints. An example of extended ontology by OptDAC Automatic gloss finding for KBs (GLOFIN) Different Document Representations 80 70 60 50 40 30 20 10 0 ο Naïve Bayes: Assumes multinomial distribution for feature occurrences, explicitly models class prior. ο Seeded K-Means: Similarity based on cosine distance between centroids and data points ο Seeded von Mises-Fisher: SSL method for data distributed on the unit hyper-sphere. 6 AKBC tasks Class π can have π data views π½π … π½π ; ππ+π : Set of class constraints between k+m classes; ππ : Labels for πΏπ . οΆ Assigning multiple labels from multiple levels of class hierarchy while satisfying ontological constraints, and considering multiple data views. 1.0 π π …πΏ ππ ππππ ππ π πππππ ; ππ : Labels of πΏπ ; πΏπ : Unlabeled data points; N: #data points; π:#Seed classes; ππ : Constraints on k seed classes. οΆ We consider a new latent variable, unobserved classes, by dynamically introducing new classes when appropriate. Coke π π 5 Macro-averaged F1 score Motivation 1 SVM Labal Propagation GLOFIN-Naïve-Bayes Precision Recall F1 Conclusions ο Exploratory learning helps reduce semantic drift of seeded classes. It gets more powerful in conjunction with multiple data views and class hierarchy, when imposed as soft-constraints on the label vectors. ο It can be applied for multiple AKBC tasks like macroreading, gloss finding, ontology extension etc. ο Datasets and code can be downloaded from: www.cs.cmu.edu/~bbd/exploratory_learning ο We developed GLOFIN method that takes a gloss-free KB, a large collection of glosses and automatically matches glosses to entities in the KB. [Dalvi et al. WSDM 2015] ο We used Glosses with only one candidate KB entity (unambiguous glosses) are used as training data to train hierarchical classification model for categories in the KB. Ambiguous glosses are then disambiguated based on the KB category they are put in. ο Our method outperformed SVM and a label propagation baseline especially when amount of training data is small. ο In future: Apply GLOFIN to word sense disambiguation w.r.t. WordNet synset hierarchy. Acknowledgements : This work is supported by Google PhD Fellowship in Information Extraction and a Google Research Grant.