Machine Learning for SEER Beatrice Alex, Ben Hachey, Yuval Krymolowski

advertisement
Machine Learning for SEER
Beatrice Alex, Ben Hachey, Yuval Krymolowski
Bootstrapping Techniques for Text Mining – p.1/19
Outline
SEER Goals
Overview of the entity data
Proposed methodology
Bootstrapping
Active learning
Unsupervised approaches
Conclusion
Bootstrapping Techniques for Text Mining – p.2/19
SEER
Stanford Edinburgh Entity Recognition
Project Aims:
Overall aim:
To provide high-quality, general purpose entity
recognition components.
To generalise the class of entities that can be
recognized within a wide variety of domains.
To make innovative use of machine-learning techniques
to enable the use of unnannotated corpora as training
material.
SEER is concerned with entities in general,
not only named entities.
Bootstrapping Techniques for Text Mining – p.3/19
A Variety of Entities and Properties
Different entity types have different properties, which we
can exploit for recognition and initial annotation.
Entity types vary in:
the required set of features,
the initial annotation quality,
the difficulty of the learning task,
hence:
the strategy that would yield the best results.
Bootstrapping Techniques for Text Mining – p.4/19
RCAHMS Data
remains of building were exposed
“
and fragments of hard baked pottery
”
“ The Innocent Railway was one of
Scotland’s first freight railways and
was opened in 1831 to carry coal from
the mines in Dalkeith into the city. ”
Entity
Cap. Gaz. Features Comp. Chunk
Location:
+
+
local
–
–
Site:
+
–
local
+
–
Not always compositional (Arthur’s Seat).
Term:
–
+
global
+
+
+
+
Divides into subtypes
Artefact:
–
–
local
Bootstrapping Techniques for Text Mining – p.5/19
Is the term interesting?
A term appearing in a text may not always be
interesting.
We regard a term as interesting when it is related with a
site.
The terms related with a site are also likely to be related
to each other.
“ A township comprising one unroofed
building, twenty-five roofed buildings,
one of which is annotated Mill,
twenty-two unroofed structures, some
enclosures, several field walls and a
head-dyke is depicted
”
In this passage, the term “building” is interesting
because it is frequent and appears with related terms.
Bootstrapping Techniques for Text Mining – p.6/19
Biomedical Text Data
“ Edman analysis of the purified bovine brain
GGPPSase suggested that the NH -terminal
”
Entity
Process:
an important molecule responsible for
“
”
the C -prenylated protein biosynthesis
Cap. Gaz. Features Comp. Chunk
mixed
+
local
+
+(?)
Experimental and biological processes,
Biological processes form a hierarchy (GO)
Organism:
-/+
+
local
–
–
Appears as a modifier as well as a proper name
Molecule:
Organ:
mixed
–
?
+
local
local
–
–
–
–
Bootstrapping Techniques for Text Mining – p.7/19
”
“ The amino acid sequence KTQETVQRIL
Further Biomedical Entities
a combination of ZIP and NFAT
“
binding sites
is required
”
Entity
Cap. Gaz. Features Comp. Chunk
Sequence:
+
–
local
–
–
Region:
–
–
local
+
+
many entities contain specialised vocabulary or
non-words,
high compositionality in processes, gene names
there are more non-named entities than RCAHMS,
a lot of external structured knowledge.
Bootstrapping Techniques for Text Mining – p.8/19
Proposed Methodology
The main problem is how to handle a new NER task.
Define a taxonomy of entity classes along
the above lines, ideas welcome!
Match the learning strategy with the statistical
properties of entities in the different classes.
Experiment with active learning, bootstrapping, or
unsupervised methods.
Different approaches may be complementary.
Given a new domain, entities will be classified and
matched with a learning approach based on SEER
experience.
Bootstrapping Techniques for Text Mining – p.9/19
Active Learning
Determine which of a set of unlabelled examples are likely
to be the most useful.
Two principal types of active learning:
Uncertainty-based:
Selects examples with lowest confidence scores.
Committee-based:
Selects examples for which there is disagreement
among classifiers.
Bootstrapping Techniques for Text Mining – p.10/19
Active Learning
Collect large corpus and provide initial annotation.
Train a classifier on labelled material.
Let the classifier label raw corpus.
Select instances for manual annotation.
Using a confidence score provided by the classifier.
Alternatively, create an aggregate of classifiers and
estimate the confidence from their outputs.
Any confidence estimate may be unreliable
especially in the beginning of the process.
The sampling approach can be used in order to
provide an additional confidence estimate.
With the progress, we can compare the two
estimates to see if we can trust the classifier’s
confidence estimate.
Bootstrapping Techniques for Text Mining – p.11/19
Active Learning in SEER
What entities would benefit from active learning?
Entities that require a high number of features for
classification, harder to learn.
For such entities, it is likely that classifiers will tend to
disagree or not be confident more often, and the
contribution of a human annotator will be the most
important.
With a large number of features, a lot of instances
will be classified even when the training data is
small, but with poor precision.
Suggestions: RCAHMS site, artefact
Bootstrapping Techniques for Text Mining – p.12/19
Bootstrapping
Idea: to create a model from little labelled training data and
a large amounts of unlabelled data.
Classical bootstrapping:
tagging unlabelled data and adding it to the training
data to learn a new model.
This approach introduces too much noise into the
model and can even degrade performance (e.g. Clark
et al. 2003, Carerras et al. 2003)
Task:
determine instances of newly labelled examples that
are most useful and accurate for re-training
Co-training proved successful for a range of NLP tasks
Bootstrapping Techniques for Text Mining – p.13/19
Co-training using Different Features
Train classifier using different feature sets
(e.g. internal vs. external)
Collins and Singer (1999):
Used spelling and contextual features in parsed
data.
Either type is sufficient due to feature redundancy.
Started off with 7 seed rules and at each iteration
induced either additional context or spelling rules to
label the training data.
Directly maximised an objective function based on
the level of agreement between the classifiers.
Abney (2002) proves that when the feature sets satisfy
certain independency conditions, such an approach
would indeed yield a classifier with a bounded error.
Bootstrapping Techniques for Text Mining – p.14/19
Co-training using Different Classifiers
Clark et al (2003):
Use two taggers for co-training, each providing a different
(not entirely independent) view of the task
Taggers can learn from each other to improve
performance.
Important: must use classifiers that are sufficiently
different.
Best results using agreement-based co-training:
select the subset of newly labelled examples that
maximises the agreement of the two taggers on
unlabelled data
Naive co-training does similarly well but is less
cost-effective
Bootstrapping Techniques for Text Mining – p.15/19
Co-training in SEER
What entities would benefit from co-training?
Entities where we can naturally distinguish two
independent feature sets:
Internal composition and external context
For example: Amino acids sequences, protein
names,
Entities for which we can easily supply reliable seed
data for co-training
Mostly from gazetteers or compositional structure
For example: RCAHMS terms, sites
Bootstrapping Techniques for Text Mining – p.16/19
Unsupervised Approaches
Look at similarities among entities in the data.
Such approaches rrely on the distribution of features and
benefit from a large feature set.
Extract a set of instances as entity candidates,
possibly using NP chunking.
Describe each of them using, e.g. contextual features.
Use a method that not only clusters the instances, but
also does feature selection.
Utilize the features for bootstrapping or active learning.
Clusters may include instances that are not captured by
the other approaches.
Bootstrapping Techniques for Text Mining – p.17/19
Unsupervised Approaches
Possible methods:
Methods that consider clusters as latent variables and
optimise a target probabilistic model.
PLSA (Hoffman 1999):
A joint probability model for instances and features,
treats instances and features symmetrically.
Information Bottleneck (Tishby, Pereira, Lee 1993):
Clusters the instances, whereas cluster centroids
express the relevance of features.
Optimises compression vs. abstraction, considering
a tradeoff parameter
Maximum-Likelihood EM is a special case with
.
Bootstrapping Techniques for Text Mining – p.18/19
Summary
With available learning tasks, we can use active
learning, co-training and unsupervised methods for
different entity types.
The gathered experience will be used for new tasks
We plan to develop a methodological approach for
adapting the most suitable learning method to a given
entity or entity set.
Bootstrapping Techniques for Text Mining – p.19/19
Download