Issues in NERC of Locations M Nissim and J Leidner 2004–01–27

advertisement
Issues in NERC of Locations
M Nissim and J Leidner
2004–01–27
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
1
Issues for Toponym Recognition and Classification
• Debate: knowledge-lean or knowledge-intensive NE for recognition of
locations?
– Unlike person names, huge semi-exhaustive gazetteers available with
world-wide coverage
• Features: co-use of multiple gazetteers, non-binary features?
– What about selecting sub-gazetteers dynamically on demand? (if broad
area is Scotland ⇒ switch to Scottish gazetteer + backoff gazetteer)
• Prediction of fine-grained NE types from coarse-grained NE types to
circumvent lack of training data
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
2
‘Grand Challenges’ in NERC
• Scalability: lack of training material →discussed elsewhere
• Granularity (deeper look)
• Structure
• Noisy input
• Taxonomies
• Criteria
• Linking
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
3
Recurring Themes in NERC
• Granularity
– How fine-grained should we define our NEs classes?
DOG BREED, CAR BRAND versus ARTIFACT
– Flat or hierarchical? What is the learnability impact resulting from
particular choices?
POPULATED AREA versus ENAMEX:LOCATION:POPULATED AREA
– How many instances per class do we need for training? How can we
reduce the number?
– How can we predict sub-classes, given broad top-level class labels?
∗ Web validation (Magnini et al. 2002)
∗ check head noun against WordNet
∗ Evaluation: recoverability? → proposed MSc project
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
4
• Structure
– NEs can be composite
Ontology
NE
– Internal structure is interesting,
helps identify them
LOCATION
LOC−ARTIFACT
LOCORG−ARTIFACT
CITY
LOCORG−ARTIFACT
CITY
– Examples: Mt Vesuvius, St.
Martin-in-the-Fields, Dawlat alQatar
Class Prior: P(CITY)
CITY
, US−STATE LOC−ARTIFACT ORG−CUE , CITY
Reading , MA
Peterhouse
College
, Cambridge
Classification: Prior P(TYPE=CITY|WORD="Reading")
– Can use methods
statistical parsing
from
Resolution:
OTHER EVIDENCE
Reading,
MA, USA
42:32 N
71:06 W
M Nissim and J Leidner
Issues in NERC of Locations
P(<x, y> | TYPE=CITY, WORD="Reading")
Reading,
England, UK
51:28 N
0:59 W
2004–01–27
5
• Learning from noisy input
– Gazetteers often messy or automatically extracted (non-names in
RCHAMS; protein lists in retrieval in the bio-genetic domain)
– Requires development of more advanced approximate matching
techniques: Levinsthein distance; Dice coefficient (1945)
• Taxonomies for NERC
– We want more fine-grained NEs–as specific as can be (Sekine 2002)
– Relate NEs across languages (Collier et al. 2002 annotate with RDF
pointers)
– But: which relation? Hyponymy versus meronymy/holonymy! (Smith
and Rosse 2004)
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
6
• Criteria for defining sets of classes
– Guiding design principles needed, such as the theory of ‘granular
partitions’ (Kumar and Smith 2003)
• Linking NEs to external resources
– Automatic hyperlinking (e.g.
images of Scottish sites, product
descriptions)
– Co-reference resolution of NEs, e.g. Edinburgh–Embra (Uryupina,
forthcoming)
– Location name 7→ grid reference (Leidner 2003)
M Nissim and J Leidner
Issues in NERC of Locations
2004–01–27
Download