Donghee Sung

advertisement
Research Paper Presentation – CS572 Summer 2011
Paper by Paul Clough
(University of Sheffield Western Bank)
Presented by Donghee Sung

SPIRIT:
Spatial awareness to information systems
e.g.
transport timetables
routing system for motorists
map-based web sites
location based services
Key Part:
Extraction and use of geospatial information

Criteria
Speed, Reliability, Flexibility, Multilingualism

Geo-Parsing:
- Identifying geographic references
- Gazetteer lookup with context rules to filter out
common-usage words and personal names

Geo-Coding:
- Assigning spatial coordinate
- Based on information of geographic resource
< http://www.geo-spirit.org/ >

SPIRIT
SPatially-Aware Information Retrieval
on the InterneT
A search engine
to find documents and datasets on the web
relating to place or regions

Poor existing web search facilities
find information related to a particular location.
Vicinity: find other places within radius
www.somewherenear.com
Yellow pages services:
find a specific place or post code
Buyukkten:
associated admin’s IP with telephone area code
Stanford Research Institute:
proposed ‘.geo’ with cells with latitude and longitude

Resources relating to place
may not be found
may not be places nearby
may have another name

Major Shortcoming:
cannot recognize alternative name
modern/historical variants
informal name
contained places name

SPIRIT Project
Query expansion / relevance ranking procedures
Machine learning techniques
extraction of geographical context
generating metadata
Multi-modal user interface
textual input
interactive map feedback
Spatial indices for web collections.

Sources of Spatial Data
TGN, OS, SABE

A large web collection of SPIRIT

Tokenization Issues
Stop-words
Named-Entitiy Recognition (NER)
Gazetteers

Named-Entity Recognition (NER)
Processing a text and identifying to particular
categories of Named Entities(NE)
People, Organization, Location. etc

Tokenization Procedure
1) Tokenized on whitespace
@words = split(/s+/, $sentence);
(Perl Regular Expressions)
"Isn't it ashame.“ -> Isn't / it / ashame.
2) Stemming / Case conversion.
isn't / it / asham
3) Removing stop-words

Default setting in indexing and retrieving
- Case sensitivity: Off
- Stop-word removal: Off
- Stemming: Off
Stop-word removal / stemming
-> Reduce the size of index files
But, can be useful:
Stop-words : ‘in’, ‘inside’, or ‘of’
Stemming: “London” from “London” &“Londoner”.

Filtering candidate locations
using context rules to remove
stop-words
references to people and organizations,
and links to emails/URLs

Geo-Parsing method could be improved
by enhancing the gazetteer matching and filtering

False hits would be reduced
by generating better list of stop-words and using
further context rules could reduce

Need for creating rules would be alleviate
by generating further context rules with features on
machine learning
[3]
Jones C.B., R. Purves, A. Ruas, M. Sanderson, M. Sester, M.J.
van Kreveld, R. Weibel (2002). Spatial information retrieval and
geographical ontologies an overview of the SPIRIT project.
SIGIR 2002: In SIGI’02, Tampere, Finland, 387-388.
[6]
Joho, H. and Sanderson, M. (2004) The SPIRIT collection: an
overview of a large web collection. In SIGIR Forum, 38(2), 57-61.
[8]
Mikheev A., Moens M. and Grover C. (1999) Named Entity
recognition without gazetteers. In Proceedings of the Annual
Meeting of the European Association for Computational
Linguistics EACL'99, Bergen, Norway, 1-8.
Spatially-Aware Information Retrieval on the Internet - A Working
Searching System
Download