Geographical Web Search Engines and Geographical Information Retrieval (GIR) Christopher Jones Cardiff University Edinburgh Euro GeoInf 2007 1 Where is Geo-information? Personal knowledge (in our heads) – of landscape, of where things, people and services are located, where things happened… Documents (various media) – Lists of where facilities, resources, structures are located – Textual descriptions of geographic phenomena – Images and videos of geographic space Maps Edinburgh Euro GeoInf 2007 2 GIS and the Web A GIS typically : World Wide Web is : – Isolated – Supports individual organisation – Accessed privately – Small range of topics – Structured data / geo-coded locations – Finds answers – Complicated to use – Global networked – Supports everyone on Internet – Accessed publicly – Vast range of topics – Unstructured free text / images – Finds documents – Easy to use Edinburgh Euro GeoInf 2007 3 WWW as a source of geo-information • Geographic context embedded in natural language descriptions • Web queries depend on exact match of text terms • No intelligent interpretation of spatial relationships (“near”, “west” etc) • Place names ambiguous and confused with names of organisations, people, buildings and streets • No geo-relevance ranking 4 Current motivation of GIR : Find geo-specific resources on the Web (mostly documents and images) find web resources about Something related_to Somewhere related_to = in, near, within Xkm, north_of ..etc. • Resolve ambiguity of names (many places have same name) • Interpret the query spatial relationships query footprint near north • Find documents geographically associated with region of query footprint • Relevance rank geographically by place and subject 5 GIR, GIS and The Web Geoknowledge World Knowledge GIS GIR The Web Edinburgh Euro GeoInf 2007 6 Geographical Search Engines • Google etc have “local” versions. -Based on business (yellow pages) directories. Edinburgh Euro GeoInf 2007 7 Geographical Search Engines SPIRIT research prototype general geo-web search Structured user interface: Dropdown menu of spatial relationships Edinburgh Euro GeoInf 2007 8 Geographical search engines SPIRIT Results listed as URLs Plus symbols on map Edinburgh Euro GeoInf 2007 9 User Interface screen shots from Ross Purves et al University of Zurich Anatomy of a Geographical Search Engine Query disambiguation User Interface Query footprint Broker Search Request + Query footprint Ranked Results Web Textual Resources Spatial Text Indexing Geotagging Place Ontology Document Footprints Search Engine Unranked Results Ranked Results Relevance Ranking Indexes Spatial Textual Textual Spatial 10 Geo-Tagging = Geo-parsing + Geo-coding Geo-parsing Recognising genuine geographic references (place names, addresses, post codes, phone codes ) ignoring non-geographic uses. Geo-coding – Attaching a unique quantitative locations (footprint) to geographic references 11 Geo-Parsing : true & false references Some types of false geographic reference • Personal names Smedes York • Business name Dorchester Hotel, York Properties.. • Street names Oxford Street, London Road… • Common words that are also places urban, institute, land, battle, derby, over, well, …… Edinburgh Euro GeoInf 2007 12 Geo-Parsing : distinguishing between false and true geo-references Look for patterns and context Personal names (Jack London, Mr York): <First_name> <Location>; <Title> <Location> Business names (Paris Hotel) : <Business_type> <Location> (or vice versa) Street names (Oxford Street) : <Location> <Road_type> Detect spatial propositions in, near, south of, outside etc “he lived in Over” Genuine occurrencesEdinburgh can be used to train machine learning Euro GeoInf 2007 13 Geo-coding (grounding) the genuine geo-references Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield……… Use context to decide (references to parent or nearby places ) Or – choose most important one (by population or place Edinburgh Euro GeoInf 2007 type hierarchy) 14 Anatomy of a Geographical Search Engine Query disambiguation User Interface Query footprint Broker Search Request + Query footprint Ranked Results Web Textual Resources Spatial Place Ontology Text Indexing Search Engine Unranked Results Ranked Results Relevance Ranking Indexes Spatial Textual Geo- Document Textual Footprints tagging Edinburgh Spatial Euro GeoInf 2007 15 Indexing Web Resources Standard text index is inverted file Query: Restaurants in Cardiff Find documents that contain all terms Text Term List of resources containing term apple Doc79, Doc89, Doc822…. Cardiff Doc2, Doc19, Doc37, … door Doc16, Doc49, Doc112….. hotel Doc1, Doc2, Doc23, … in Doc4, Doc7, Doc19… Works literally for “in” London Doc20, Doc35, Doc150….. but won’t find contained places. pub Doc9, Doc11, Doc100, … Doesn’t work in general restaurant Doc19, Doc22, Doc37, .. for “near”, ……………………. ………………………………………….. “Xkms from”, “north_of” etc Edinburgh Euro GeoInf 2007 16 Why Spatial Indexing? Query “Hotels outside and within 30Kms of Glasgow” Need to find documents referring to hotels that are in places other than Glasgow Query : “Castles in Wales” Need to find documents that refer to names of places in Wales (perhaps without mentioning “Wales”) • In both cases to use conventional text indexing requires a query to contain the names of all places in Wales and all places outside Glasgow within 30km Edinburgh Euro GeoInf 2007 17 Spatial indexing of resources • Use dominant geographic references of documents to create document footprints (point, polygon, bounding rectangle..) • Use footprints to index documents • Convert query to a query footprint • Match query footprint to doc. footprints Spatial Query Result Edinburgh Euro GeoInf 2007 18 Anatomy of a Geographical Search Engine Query disambiguation User Interface Query footprint Broker Search Request + Query footprint Ranked Results Web Textual Resources Spatial Place Ontology Text Indexing Search Engine Unranked Results Ranked Results Relevance Ranking Indexes Spatial Textual Geo- Document Textual Footprints tagging Edinburgh Spatial Euro GeoInf 2007 19 Geographical Relevance Ranking Example: airports near Leicester the further away, the lower the spatial score • Determine “distance” between query footprint and document footprint Q D • Depends on query spatial operator (in, outside, X Kms from, north_of etc) Euro GeoInf 2007 20 Spatial score Edinburgh Figure from Marc van Kreveld, University of Utrecht Combining textual and spatial scores • Textual scores: BM25 • Spatial scores: by spatial footprint analysis query / ideal footprint 1 normalized BM25 score footprints of documents 0 spatial score 1 Figure from Marc van Kreveld University of 21 Utrecht Anatomy of a Geographical Search Engine Query disambiguation User Interface Query footprint Broker Search Request + Query footprint Ranked Results Web Textual Resources Spatial Place Ontology Text Indexing Search Engine Unranked Results Ranked Results Relevance Ranking Indexes Spatial Textual Geo- Document Textual Footprints tagging Edinburgh Spatial Euro GeoInf 2007 22 Place Ontology Encodes knowledge of terminology and structure of geographic space • • • • alternative names, languages place types (political, topographic, social.. ) footprint (point, MBR, polygon) spatial relationships and attributes : containment, adjacency, overlap • imprecise (vernacular) places (“Midlands”, “south of France”, “Scottish borders”, “Pennines”, “Highlands”…..) Derive from gazetteers, thesauri, maps & the web Edinburgh Euro GeoInf 2007 23 Roles of Place Ontology User Interface Metadata Extraction document footprints Geo-Tagging Query Disambiguation ontology Web collection document footprints Spatial Index Relevance Ranking Relevance Ranking Query Expansion (query footprint) Edinburgh Euro GeoInf 2007 Search Component 24 Mining text on the web for vernacular place name knowledge • Objective: estimate spatial extent of vague place • Documents that refer to vague places may also refer to more precise places inside them. • Places that occur frequently in association with a target named place may have higher chance of being inside • Analyse frequency of occurrence of colocated places Edinburgh Euro GeoInf 2007 25 Places mentioned in documents retrieved by queries on the “Cotswolds” Edinburgh Euro GeoInf 2007 Figure from Ross26 Purves et al University of Zurich GIR and GIS • GIR currently dominated by web search – Unstructured results in multiple documents • Sometimes single focused result wanted • Hotels within 1 kilometre of the British Museum in London • Where are pre-sixteenth century dwellings in USA? • Which areas of East Anglia would be flooded if seaEdinburgh levelEurorose by 1 metre? GeoInf 2007 27 Bringing GIR and GIS together Geo-knowledge Geoknowledge World Knowledge GIS World Knowledge GIR The Web Edinburgh Euro GeoInf 2007 GIS GIR The Web 28 GeoInformation Services Encode Geo-information in Web Services (Geo-services) • Parse natural language queries • Interpret geo-terminology of queries • Identify the relevant geo-services to match geo and non-geo concepts • Compose appropriate chain of services Edinburgh Euro GeoInf 2007 29 EU - TRIPOD Project • • • • • • • • • Improve accessibility of images on web Focus on geographical context Enhance captions / metadata for archival images Automatically generate captions for images from location / orientation – aware cameras Web harvesting to enrich metadata Interpret (vague) spatial natural language Toponym ontology of places and landmarks (including vernacular places) Use 3D landscape models to determine what is in camera view Prototype image search engine Edinburgh Euro GeoInf 2007 30 http://tripod.shef.ac.uk/index.html Future of GIR? • Improve “conventional GIR” components: – Geo-tagging, spatio-textual indexing and geo-relevance ranking • Place ontologies with world-wide coverage • Understanding of spatial natural language • Integrate time & space (temporal language) • Open GeoInformation Web services • Adapt GIR to personal needs & location Edinburgh Euro GeoInf 2007 31 More Information • See www.geo-spirit.org for information on SPIRIT project and downloads of articles and project deliverables. [N.B. Prototype search engine (with link from SPIRIT web site) is no longer functional] TRIPOD : www.ProjectTripod.org Edinburgh Euro GeoInf 2007 32