Research Paper Presentation – CS572 Summer 2011 Paper by Paul Clough (University of Sheffield Western Bank) Presented by Donghee Sung SPIRIT: Spatial awareness to information systems e.g. transport timetables routing system for motorists map-based web sites location based services Key Part: Extraction and use of geospatial information Criteria Speed, Reliability, Flexibility, Multilingualism Geo-Parsing: - Identifying geographic references - Gazetteer lookup with context rules to filter out common-usage words and personal names Geo-Coding: - Assigning spatial coordinate - Based on information of geographic resource < http://www.geo-spirit.org/ > SPIRIT SPatially-Aware Information Retrieval on the InterneT A search engine to find documents and datasets on the web relating to place or regions Poor existing web search facilities find information related to a particular location. Vicinity: find other places within radius www.somewherenear.com Yellow pages services: find a specific place or post code Buyukkten: associated admin’s IP with telephone area code Stanford Research Institute: proposed ‘.geo’ with cells with latitude and longitude Resources relating to place may not be found may not be places nearby may have another name Major Shortcoming: cannot recognize alternative name modern/historical variants informal name contained places name SPIRIT Project Query expansion / relevance ranking procedures Machine learning techniques extraction of geographical context generating metadata Multi-modal user interface textual input interactive map feedback Spatial indices for web collections. Sources of Spatial Data TGN, OS, SABE A large web collection of SPIRIT Tokenization Issues Stop-words Named-Entitiy Recognition (NER) Gazetteers Named-Entity Recognition (NER) Processing a text and identifying to particular categories of Named Entities(NE) People, Organization, Location. etc Tokenization Procedure 1) Tokenized on whitespace @words = split(/s+/, $sentence); (Perl Regular Expressions) "Isn't it ashame.“ -> Isn't / it / ashame. 2) Stemming / Case conversion. isn't / it / asham 3) Removing stop-words Default setting in indexing and retrieving - Case sensitivity: Off - Stop-word removal: Off - Stemming: Off Stop-word removal / stemming -> Reduce the size of index files But, can be useful: Stop-words : ‘in’, ‘inside’, or ‘of’ Stemming: “London” from “London” &“Londoner”. Filtering candidate locations using context rules to remove stop-words references to people and organizations, and links to emails/URLs Geo-Parsing method could be improved by enhancing the gazetteer matching and filtering False hits would be reduced by generating better list of stop-words and using further context rules could reduce Need for creating rules would be alleviate by generating further context rules with features on machine learning [3] Jones C.B., R. Purves, A. Ruas, M. Sanderson, M. Sester, M.J. van Kreveld, R. Weibel (2002). Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. SIGIR 2002: In SIGI’02, Tampere, Finland, 387-388. [6] Joho, H. and Sanderson, M. (2004) The SPIRIT collection: an overview of a large web collection. In SIGIR Forum, 38(2), 57-61. [8] Mikheev A., Moens M. and Grover C. (1999) Named Entity recognition without gazetteers. In Proceedings of the Annual Meeting of the European Association for Computational Linguistics EACL'99, Bergen, Norway, 1-8. Spatially-Aware Information Retrieval on the Internet - A Working Searching System