Issues in NERC of Locations M Nissim and J Leidner 2004–01–27 M Nissim and J Leidner Issues in NERC of Locations 2004–01–27 1 Issues for Toponym Recognition and Classification • Debate: knowledge-lean or knowledge-intensive NE for recognition of locations? – Unlike person names, huge semi-exhaustive gazetteers available with world-wide coverage • Features: co-use of multiple gazetteers, non-binary features? – What about selecting sub-gazetteers dynamically on demand? (if broad area is Scotland ⇒ switch to Scottish gazetteer + backoff gazetteer) • Prediction of fine-grained NE types from coarse-grained NE types to circumvent lack of training data M Nissim and J Leidner Issues in NERC of Locations 2004–01–27 2 ‘Grand Challenges’ in NERC • Scalability: lack of training material →discussed elsewhere • Granularity (deeper look) • Structure • Noisy input • Taxonomies • Criteria • Linking M Nissim and J Leidner Issues in NERC of Locations 2004–01–27 3 Recurring Themes in NERC • Granularity – How fine-grained should we define our NEs classes? DOG BREED, CAR BRAND versus ARTIFACT – Flat or hierarchical? What is the learnability impact resulting from particular choices? POPULATED AREA versus ENAMEX:LOCATION:POPULATED AREA – How many instances per class do we need for training? How can we reduce the number? – How can we predict sub-classes, given broad top-level class labels? ∗ Web validation (Magnini et al. 2002) ∗ check head noun against WordNet ∗ Evaluation: recoverability? → proposed MSc project M Nissim and J Leidner Issues in NERC of Locations 2004–01–27 4 • Structure – NEs can be composite Ontology NE – Internal structure is interesting, helps identify them LOCATION LOC−ARTIFACT LOCORG−ARTIFACT CITY LOCORG−ARTIFACT CITY – Examples: Mt Vesuvius, St. Martin-in-the-Fields, Dawlat alQatar Class Prior: P(CITY) CITY , US−STATE LOC−ARTIFACT ORG−CUE , CITY Reading , MA Peterhouse College , Cambridge Classification: Prior P(TYPE=CITY|WORD="Reading") – Can use methods statistical parsing from Resolution: OTHER EVIDENCE Reading, MA, USA 42:32 N 71:06 W M Nissim and J Leidner Issues in NERC of Locations P(<x, y> | TYPE=CITY, WORD="Reading") Reading, England, UK 51:28 N 0:59 W 2004–01–27 5 • Learning from noisy input – Gazetteers often messy or automatically extracted (non-names in RCHAMS; protein lists in retrieval in the bio-genetic domain) – Requires development of more advanced approximate matching techniques: Levinsthein distance; Dice coefficient (1945) • Taxonomies for NERC – We want more fine-grained NEs–as specific as can be (Sekine 2002) – Relate NEs across languages (Collier et al. 2002 annotate with RDF pointers) – But: which relation? Hyponymy versus meronymy/holonymy! (Smith and Rosse 2004) M Nissim and J Leidner Issues in NERC of Locations 2004–01–27 6 • Criteria for defining sets of classes – Guiding design principles needed, such as the theory of ‘granular partitions’ (Kumar and Smith 2003) • Linking NEs to external resources – Automatic hyperlinking (e.g. images of Scottish sites, product descriptions) – Co-reference resolution of NEs, e.g. Edinburgh–Embra (Uryupina, forthcoming) – Location name 7→ grid reference (Leidner 2003) M Nissim and J Leidner Issues in NERC of Locations 2004–01–27