Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana ¶* Dalvi , Aditya † Mishra , and William W. * Cohen ¶ Allen Institute for Artificial Intelligence, * School Of Computer Science, Carnegie Mellon University, † Department of Computer Science & Software Engineering, Seattle University Motivation Method: OptDAC Exploratory EM Inputs: π π :Labeled glosses; π π : πΆππ‘πππππ¦ ππππππ ππ π π ; ο In an entity classification task, topic or concept hierarchies are often incomplete. This can lead to semantic drift of known classes or topics. ο Our previous work on Exploratory Learning (Dalvi et al. ECML 2013) extends the semi-supervised EM algorithm by dynamically adding new classes when appropriate. In this paper, we present Exploratory learning techniques for hierarchical semisupervised learning tasks. Initialize the model This dataset is made publicly available at http://rtw.ml.cmu.edu/wk/WebSets/hierarc hical_ExploratoryLearning_WSDM2016/ index.html Statistic Small #Classes #levels in the hierarchy #classes per level π½ππ 75 FLAT-ExploreEM OptDAC ExploreEM 65 with a few seeds per class πͺπ οΆ E Step (Iteration t): Assign a bit vector of categories to each gloss 55 45 35 For i = 1 : N Find π πΆπ ππ ; πππ‘−1 ) for all classes πΆπ 25 Level = π πππ = Optimal-Label-Assignment π· πͺπ πΏπ ; π½π−π , π ) π {If a new class is created, then class constraints are updated accordingly.} (π) π = UpdateConstraints πΏπ , ππ , πΏπ , ππ , π(π) 2 3 4 OptDAC with varying amount of training data Text-Small Table-Small οΆ M step: Re-compute model parameters Re-compute π½ππ based on current label assignments πππ‘ . οΆ Do model selection Optimal Label Assignment given Class Constraints Input: π πΆπ ππ ) , Class constraints: Subset, Mutex(disjoint) Output: Consistent bit vector π¦ππ for ππ Runtime of Flat vs. OptDAC method on different datasets Dataset Avg. Runtime in sec. FLAT Semisupervised EM Text-Small 53.5 Table-Small 50.7 Text-Medium 524.7 Table-Medium 5932.4 Ontology Medium 3 11 1, 3, 7 denotes statistically significant improvements (0.05 significance level) w.r.t. FLAT ExloreEM ππ’; ο± Iterate till convergence (till data likelihood AND #classes converges) ο KB categories are arranged in an ontology. There are subset and disjointness constraints defined between these classes. Further, the class hierarchy can be incomplete. Datasets Comparison: macro averaged seeded-class F1 Outputs: Labels for π½π … π½π+π parameters for k seed and m newly added classes; ππ+π set of constraints between k+m classes ο We focus on entity classification task where each entity is represented by either text context or table co-occurrence features. Given a few seed examples per Knowledge Base(KB) category, the task is to classify unlabeled entities into KB categories. ο Our proposed method (OptDAC) can learn new examples of existing classes, as well as extend the class hierarchy in a single unified framework. OptDAC reduces semantic drift of seeded classes. π π’ :Unlabeled glosses ; N: |X|; K: number of classes; ππ : Class constraints (subclass or disjointness constraints); ππ’ : Experimental Results 4 39 1, 4, 24, 10 Avg. runtime in multiple of Flat Semisupervised EM FLAT OptDAC Exploratory SemiExploratory EM supervised EM EM 8 7 17 3 10 21 5 11 25 4 7 10 Evaluation of extended class hierarchies Maximize {likelihood of assignment – constraint violation penalty} Small Ontology Score of label assignment Medium Ontology Subset constraint Penalty Mutex constraint Penalty Subset constraint Mutex Constraint When New Classes Are Created? Conclusions 1 Dataset Statistics #Entities #Features # (Entity, label) pairs Text-Small Text-Medium Table-Small 2.5K 12.9K 4.3K 3.4M 6.7M 0.96M 7.2K 42.2K 12.2K Table-Medium 33.4K 2.2M 126.K 5 ο An example Text pattern feature for entity “Pittsburgh” is (“lives in ARG”, 1000), indicating that the entity Pittsburgh appeared in position ARG of the text context “live in ARG” for 1000 times in the sentences from Clueweb09 dataset. ο An example Table context feature for entity “Pittsburgh” is (“clueweb09-en0011-94-04::2:1”, 1) indicates that the entity “Pittsburgh” appeared once in HTML table 2, column 1 from ClueWeb09 document id “clueweb09-en0011-94-04”. 3 2 6 7 4 8 9 10 Near uniform? Cnew Test: Best assignment using the mixed integer program should pick Cnew 11 ο In this paper, we propose the Hierarchical Exploratory EM approach that can take an incomplete class ontology as input, along with a few seed examples of each class, to populate new instances of seeded classes and extend the ontology with newly discovered classes. ο Our proposed hierarchical exploratory EM method, named OptDACExploreEM performs better than flat classification and hierarchical semisupervised EM methods at all levels of hierarchy, especially as we go further down the hierarchy. ο Experiments show that OptDAC-ExploreEM outperforms its semisupervised variant on average by 13% in terms of seed class F1 scores. It also outperforms both previously proposed exploratory learning approaches FLAT-ExploreEM and DAC-ExploreEM in terms of seed class F1on average by 10% and 7% respectively. ο In the future, we would like to apply our method on datasets with nontree structured class hierarchies. Acknowledgements : This work is supported in part by Google PhD fellowship in Information Extraction, and NSF grant No. IIS1250956-NSFCOHEN.