Machine Learning for SEER Beatrice Alex, Ben Hachey, Yuval Krymolowski Bootstrapping Techniques for Text Mining – p.1/19 Outline SEER Goals Overview of the entity data Proposed methodology Bootstrapping Active learning Unsupervised approaches Conclusion Bootstrapping Techniques for Text Mining – p.2/19 SEER Stanford Edinburgh Entity Recognition Project Aims: Overall aim: To provide high-quality, general purpose entity recognition components. To generalise the class of entities that can be recognized within a wide variety of domains. To make innovative use of machine-learning techniques to enable the use of unnannotated corpora as training material. SEER is concerned with entities in general, not only named entities. Bootstrapping Techniques for Text Mining – p.3/19 A Variety of Entities and Properties Different entity types have different properties, which we can exploit for recognition and initial annotation. Entity types vary in: the required set of features, the initial annotation quality, the difficulty of the learning task, hence: the strategy that would yield the best results. Bootstrapping Techniques for Text Mining – p.4/19 RCAHMS Data remains of building were exposed “ and fragments of hard baked pottery ” “ The Innocent Railway was one of Scotland’s first freight railways and was opened in 1831 to carry coal from the mines in Dalkeith into the city. ” Entity Cap. Gaz. Features Comp. Chunk Location: + + local – – Site: + – local + – Not always compositional (Arthur’s Seat). Term: – + global + + + + Divides into subtypes Artefact: – – local Bootstrapping Techniques for Text Mining – p.5/19 Is the term interesting? A term appearing in a text may not always be interesting. We regard a term as interesting when it is related with a site. The terms related with a site are also likely to be related to each other. “ A township comprising one unroofed building, twenty-five roofed buildings, one of which is annotated Mill, twenty-two unroofed structures, some enclosures, several field walls and a head-dyke is depicted ” In this passage, the term “building” is interesting because it is frequent and appears with related terms. Bootstrapping Techniques for Text Mining – p.6/19 Biomedical Text Data “ Edman analysis of the purified bovine brain GGPPSase suggested that the NH -terminal ” Entity Process: an important molecule responsible for “ ” the C -prenylated protein biosynthesis Cap. Gaz. Features Comp. Chunk mixed + local + +(?) Experimental and biological processes, Biological processes form a hierarchy (GO) Organism: -/+ + local – – Appears as a modifier as well as a proper name Molecule: Organ: mixed – ? + local local – – – – Bootstrapping Techniques for Text Mining – p.7/19 ” “ The amino acid sequence KTQETVQRIL Further Biomedical Entities a combination of ZIP and NFAT “ binding sites is required ” Entity Cap. Gaz. Features Comp. Chunk Sequence: + – local – – Region: – – local + + many entities contain specialised vocabulary or non-words, high compositionality in processes, gene names there are more non-named entities than RCAHMS, a lot of external structured knowledge. Bootstrapping Techniques for Text Mining – p.8/19 Proposed Methodology The main problem is how to handle a new NER task. Define a taxonomy of entity classes along the above lines, ideas welcome! Match the learning strategy with the statistical properties of entities in the different classes. Experiment with active learning, bootstrapping, or unsupervised methods. Different approaches may be complementary. Given a new domain, entities will be classified and matched with a learning approach based on SEER experience. Bootstrapping Techniques for Text Mining – p.9/19 Active Learning Determine which of a set of unlabelled examples are likely to be the most useful. Two principal types of active learning: Uncertainty-based: Selects examples with lowest confidence scores. Committee-based: Selects examples for which there is disagreement among classifiers. Bootstrapping Techniques for Text Mining – p.10/19 Active Learning Collect large corpus and provide initial annotation. Train a classifier on labelled material. Let the classifier label raw corpus. Select instances for manual annotation. Using a confidence score provided by the classifier. Alternatively, create an aggregate of classifiers and estimate the confidence from their outputs. Any confidence estimate may be unreliable especially in the beginning of the process. The sampling approach can be used in order to provide an additional confidence estimate. With the progress, we can compare the two estimates to see if we can trust the classifier’s confidence estimate. Bootstrapping Techniques for Text Mining – p.11/19 Active Learning in SEER What entities would benefit from active learning? Entities that require a high number of features for classification, harder to learn. For such entities, it is likely that classifiers will tend to disagree or not be confident more often, and the contribution of a human annotator will be the most important. With a large number of features, a lot of instances will be classified even when the training data is small, but with poor precision. Suggestions: RCAHMS site, artefact Bootstrapping Techniques for Text Mining – p.12/19 Bootstrapping Idea: to create a model from little labelled training data and a large amounts of unlabelled data. Classical bootstrapping: tagging unlabelled data and adding it to the training data to learn a new model. This approach introduces too much noise into the model and can even degrade performance (e.g. Clark et al. 2003, Carerras et al. 2003) Task: determine instances of newly labelled examples that are most useful and accurate for re-training Co-training proved successful for a range of NLP tasks Bootstrapping Techniques for Text Mining – p.13/19 Co-training using Different Features Train classifier using different feature sets (e.g. internal vs. external) Collins and Singer (1999): Used spelling and contextual features in parsed data. Either type is sufficient due to feature redundancy. Started off with 7 seed rules and at each iteration induced either additional context or spelling rules to label the training data. Directly maximised an objective function based on the level of agreement between the classifiers. Abney (2002) proves that when the feature sets satisfy certain independency conditions, such an approach would indeed yield a classifier with a bounded error. Bootstrapping Techniques for Text Mining – p.14/19 Co-training using Different Classifiers Clark et al (2003): Use two taggers for co-training, each providing a different (not entirely independent) view of the task Taggers can learn from each other to improve performance. Important: must use classifiers that are sufficiently different. Best results using agreement-based co-training: select the subset of newly labelled examples that maximises the agreement of the two taggers on unlabelled data Naive co-training does similarly well but is less cost-effective Bootstrapping Techniques for Text Mining – p.15/19 Co-training in SEER What entities would benefit from co-training? Entities where we can naturally distinguish two independent feature sets: Internal composition and external context For example: Amino acids sequences, protein names, Entities for which we can easily supply reliable seed data for co-training Mostly from gazetteers or compositional structure For example: RCAHMS terms, sites Bootstrapping Techniques for Text Mining – p.16/19 Unsupervised Approaches Look at similarities among entities in the data. Such approaches rrely on the distribution of features and benefit from a large feature set. Extract a set of instances as entity candidates, possibly using NP chunking. Describe each of them using, e.g. contextual features. Use a method that not only clusters the instances, but also does feature selection. Utilize the features for bootstrapping or active learning. Clusters may include instances that are not captured by the other approaches. Bootstrapping Techniques for Text Mining – p.17/19 Unsupervised Approaches Possible methods: Methods that consider clusters as latent variables and optimise a target probabilistic model. PLSA (Hoffman 1999): A joint probability model for instances and features, treats instances and features symmetrically. Information Bottleneck (Tishby, Pereira, Lee 1993): Clusters the instances, whereas cluster centroids express the relevance of features. Optimises compression vs. abstraction, considering a tradeoff parameter Maximum-Likelihood EM is a special case with . Bootstrapping Techniques for Text Mining – p.18/19 Summary With available learning tasks, we can use active learning, co-training and unsupervised methods for different entity types. The gathered experience will be used for new tasks We plan to develop a methodological approach for adapting the most suitable learning method to a given entity or entity set. Bootstrapping Techniques for Text Mining – p.19/19