Background to SEER Proposal • TTT (Text Tokenisation Tool) – output was LT TTT, an XML-based toolset for low level linguistic analysis and annotation – used as basis of: • Participation in MUC7 named entity task – mostly rule-based approach – but incremental approach where max ent classifier was interleaved with rule-based annotation Stanford Link Team, May 8, 2003 Other Projects • DISP (Data Intensive Semantics & Pragmatics) – corpus of Medline abstracts – annotation of domain specific entities and terms prior to parsing • Mascara – resolution of metonymy (e.g. place for organisation) – named entity recognition a prerequisite Stanford Link Team, May 8, 2003 Other Projects • CROSSMARC – Multilingual IE from web pages – laptop computer and job offer domains: non-standard set of entities • EDIFY contract – Recognition of person names and addresses in emails Stanford Link Team, May 8, 2003 Different Domains, Different Entities • MUC • Web pages – Laptops – Job Ads • Email • Biomedical – DISP – BioCreative • Legal Stanford Link Team, May 8, 2003 Machine Learning vs Rule-Based • Rule-based systems tend to perform better but high labour costs. • Machine learning approaches are becoming competitive. – c.f. CoNLL 2002, 2003 shared task (Clark and Curran, Klein, Smarr, Nguyen and Manning). • Machine learning still requires significant amounts of training material: labour costs still high and adaptation to new domains still slow. Stanford Link Team, May 8, 2003 Goals of the SEER Project • To develop the means to recognise a wider variety of entities than before – including ones not signalled by capitalisation. • To experiment to find the most useful machine learning techniques. • To investigate boot-strapping techniques in order to minimise the amount of training data needed. Stanford Link Team, May 8, 2003 Where We Are Now • wider variety of entities: – Chosen two very different domains, biomedical (genes, proteins etc) and architectural/archaeological. • experiment with machine learning techniques: – max ent tagging using C&C and Stanford tagger • investigate boot-strapping techniques to minimise the amount of training data needed: – STARTS TODAY! Stanford Link Team, May 8, 2003 MUC Named Entity Recognition Helen Weir, the finance director of Kingfisher, was handed a £334,607 allowance last year to cover the costs of a relocation that appears to have shortened her commute by around 15 miles. The payment to the 40-year-old amounts to roughly £23,000 a mile to allow her to move from Hampshire to Buckinghamshire after an internal promotion. Stanford Link Team, May 8, 2003