Background to SEER Proposal

advertisement
Background to SEER Proposal
• TTT (Text Tokenisation Tool)
– output was LT TTT, an XML-based toolset
for low level linguistic analysis and
annotation
– used as basis of:
• Participation in MUC7 named entity task
– mostly rule-based approach
– but incremental approach where max ent
classifier was interleaved with rule-based
annotation
Stanford Link Team, May 8, 2003
Other Projects
• DISP (Data Intensive Semantics &
Pragmatics)
– corpus of Medline abstracts
– annotation of domain specific entities and
terms prior to parsing
• Mascara
– resolution of metonymy (e.g. place for
organisation)
– named entity recognition a prerequisite
Stanford Link Team, May 8, 2003
Other Projects
• CROSSMARC
– Multilingual IE from web pages
– laptop computer and job offer
domains: non-standard set of entities
• EDIFY contract
– Recognition of person names and
addresses in emails
Stanford Link Team, May 8, 2003
Different Domains, Different Entities
• MUC
• Web pages
– Laptops
– Job Ads
• Email
• Biomedical
– DISP
– BioCreative
• Legal
Stanford Link Team, May 8, 2003
Machine Learning vs Rule-Based
• Rule-based systems tend to perform better
but high labour costs.
• Machine learning approaches are becoming
competitive.
– c.f. CoNLL 2002, 2003 shared task (Clark and
Curran, Klein, Smarr, Nguyen and Manning).
• Machine learning still requires significant
amounts of training material: labour costs still
high and adaptation to new domains still slow.
Stanford Link Team, May 8, 2003
Goals of the SEER Project
• To develop the means to recognise a wider
variety of entities than before – including
ones not signalled by capitalisation.
• To experiment to find the most useful
machine learning techniques.
• To investigate boot-strapping techniques in
order to minimise the amount of training data
needed.
Stanford Link Team, May 8, 2003
Where We Are Now
• wider variety of entities:
– Chosen two very different domains, biomedical
(genes, proteins etc) and
architectural/archaeological.
• experiment with machine learning techniques:
– max ent tagging using C&C and Stanford tagger
• investigate boot-strapping techniques to
minimise the amount of training data needed:
– STARTS TODAY!
Stanford Link Team, May 8, 2003
MUC Named Entity Recognition
Helen Weir, the finance director of
Kingfisher, was handed a £334,607
allowance last year to cover the costs of a
relocation that appears to have shortened her
commute by around 15 miles. The payment to
the 40-year-old amounts to roughly £23,000 a
mile to allow her to move from Hampshire to
Buckinghamshire after an internal promotion.
Stanford Link Team, May 8, 2003
Download