1 SIMS 290-2: Search Engines (Fall 2005) Professor Marti Hearst Midterm Literature Review Mike Wooldridge (mikew@sims.berkeley.edu) Web Search and Geographic Location Location can be very important when you’re performing a Web search. If you type in “homes for sale” in the query box at Google or Yahoo!, you’re probably looking for information about homes for sale in a specific geographic region. More often than not, you’ll want listings that are near where you are now. Search engines that can understand when geographic location is important to the user, and can interpret the location information in Web documents to find the ones that are most relevant, will be able to return better search results. This literature review looks at a number of strategies for improving geographic search on the Web. As the literature shows, improvement can come from different places in the search process: from tweaking the way search engines index the content they find on the Web, to fine-tuning how they interpret the queries submitted by the user, to revamping the methods they use when matching queries to documents. Adding geographic intelligence to search engines will become increasingly important in our increasingly mobile world, where more and more people will be accessing the Web from cell phones and other devices to find information specific to where they are at that moment. Strategy #1: Learn From Libraries Written in 1995, Larson’s “Geographic Information Retrieval and Spatial Browsing”1 looks at geographic search in the context of digital libraries. The challenges and techniques described are similar to those in later papers that examine Web search. This makes sense when you consider that most of the strategies underpinning today’s Web search engines arose out of search technology developed for libraries. Central to Larson’s paper is the distinction between deterministic and probabilistic retrieval. With deterministic retrieval, a system returns results based on the exact match of query terms to the data contained in a collection. A user request is interpreted at face value, and results are returned that satisfy the request. Probabilistic retrieval carries more uncertainty, since it generally involves a subjective interpretation of what is relevant to a user. The interpretation might involve expansion of query terms using a thesaurus or categorization of documents in a collection based on subject headings. In geographic information retrieval, different types of searches can have deterministic or probabilistic properties. For instance, a “point-in-polygon” query, which asks if a given coordinate is within a containing region, is relatively deterministic. However, a query that asks what cities are within a certain distance of a landmark might involve a more 2 probabilistic interpretation. The system might need to decide whether to measure distance from a city’s center or from its border, and what to do when there is overlap between the regions being compared. The paper also discusses how a document’s vocabulary can lead to imprecision with regard to geographic search. For instance, geographic terms may refer to more than one place in the world (“San Jose”) or vary in spelling depending on the language and historical time period (“Peking” and “Beijing”). Faced with such ambiguity, designers of geographic information systems face a significant challenge when deciding how to convert geographic information in a document to precise geographic coordinates. Finally, the article describes GIPSY, an indexing tool that constructs visual, grid-based representations of documents based on the cities, countries, and other geographic terms in them. GIPSY’s representations are similar to the “geographic footprints” developed by Markowetz et al. and discussed later. Strategy #2: Employ Semantic Technologies to Understand Geographic Context In “Toward the Semantic Geospatial Web” (2002),2 Egenhofer proposes the Semantic Web as a way to help people find location-related documents online. The Semantic Web aims to add a new layer of meaning to HTML documents by using technologies such as RDF and XML to describe the relationships among the concepts contained in the documents. For instance, for the phrase “lakes in Maine,” semantic information could describe how “lake” is a body of water, “Maine” is a state in the northeastern U.S., and “in” describes the relative location of the two concepts. The Semantic Web could allow search engines to move beyond simple keyword-based analysis and “understand” documents in ways that are similar to how humans understand them. The paper also explores how to improve geographic search by bettering the way users request information from search engines. Egenhofer describes the notion of a “geospatial request,” which is constructed using geographic terms (such as “lake” and “Maine”) combined with comparators (such as “in”) that describe the relationship of the terms. Users would use such requests to define their geographic information needs concisely. According to Egenhofer, the combination of a Semantic Web of documents and a geospatial method for querying them will form the Semantic Geospatial Web, a system that allows better retrieval of Web documents based on their geographic characteristics. While it may be a while until we have a Semantic Web available to help us search for geographic content, I can imagine Web authoring tools incorporating the ability to add geographic cues (in the form of RDF) to Web content. If deployed on a large scale, such cues could make it much easier to find documents related to specific locations. I’m more skeptical of users learning to formulate their requests with highly structured geographic queries, considering the difficulties the average person has understanding and using techniques such as Boolean logic. Strategy #3: Intelligently Combine Thematic and Geographic Search 3 In “Indexing and Ranking in Geo-IR Systems” (2005) by Martins et al.,3 the authors propose to meet the challenges of geographic search by building two separate types of search indexes: keyword-based—or thematic—indexes, like the ones that power most of today’s popular search engines, and geographic indexes that relate geographic locations to particular Web documents. To create a thematic index, a search engine performs a crawl of the Web, extracting content and building a lookup table that associates each keyword with the pages in which it appears. When the search engine receives a search request, it matches the query terms with keywords in the thematic index and returns a list of pages relevant to the query. (Additional ranking methods, including link-related algorithms such as PageRank, can also be added.) As the paper describes, you can also build an index for geographic terms similar to how you build an index for the keywords in thematic search. Instead of relating keywords to documents, such an index relates geographic terms to documents. During a geographic search, the search system analyzes the query for geographic information and then returns a ranked list of documents based on matches found in the geographic index. If you have a search engine with both thematic and geographic components, you can combine them in a variety of ways. The system can use the geographic component as a filter on results retrieved from the thematic index, or vice versa. It can also give the two components different weights: geography could be emphasized for some searches (“homes for sale”) and de-emphasized for others (“home repair tips”). I thought this was a smart strategy, since the relevance of geography definitely varies depending on a user’s goal. (As we see in the Gravano paper, typically only 15% of Web searches are local in nature and require geographic analysis.) Weighting could also be applied automatically based on a user’s context. For instance, a search made from a mobile phone could bring about a stronger geographic weighting. Strategy #4: Attach “Footprints” to Pages to Represent Their Geographic Focus In “Design and Implementation of a Geographic Search Engine” (2005) by Markowetz et al.,4 the authors describe a search engine prototype and its performance with respect to 31 million Web pages in the German .de domain. To provide geographic search, the system extracts location-related terms from crawled pages, matches the terms to geographic coordinates, and then uses those coordinates to build a geographic “footprint” for each page. Each footprint is a 1,024x1,024 grid laid over a map of Germany, with each element in the grid containing a relevance value. By comparing the footprints of Web pages with a footprint defined by the user when submitting a query, the search engine attempts to produce a geographically relevant set of search results. A significant challenge that the authors encountered—a challenge also described by Larson—involved accurately assigning words to geographic coordinates on the German 4 map. This is because names of towns and other geographic features can often double as common German words (for instance, “Oder” is both the name of a river and the word “or”). To map the town names accurately, the system classified geographic names as strong (if they were unambiguous) or weak (if they were ambiguous). A weak name required a strong name in proximity to be counted as relevant; otherwise it was ignored. The system also classified some words as killer terms, which would cause strong names in their vicinity to be ignored. (“Mr.” is a killer term since it usually comes before a person’s name.) Another challenge involved the standard practice of publishing address information for a Web site on a “Contact” page. If pages were labeled based only on their own content, Contact pages would be given much more geographical importance than was practical. The authors mitigated this by using a method called “geo propagation,” where a page inherits dampened versions of the footprints of pages that are one or two links away from them. Thus the final footprint assigned to a Web page by the system was based on its own content as well as the content in its neighbors. In my opinion, building geographic footprints appears to be a nice visual way to represent the sometimes complex geographic data associated with a Web page. I can imagine an interactive search engine that allows users to construct a footprint to associate with their query using a touch screen or a pen tablet. A search engine such as the one in this paper could also remember the more popular or useful footprints, and offer them to users in advance. I also found some of the geographic idiosyncrasies common to the German test bed interesting. For instance, most German addresses don’t include a state, which makes disambiguation of city names on a page more difficult. But German laws also require Web sites to include contact information a minimum of two clicks away from every site page, which makes geographic inferences easier. Such idiosyncrasies suggest that differences between countries could present a challenge when trying to optimize a geographic search engine across borders. Strategy #5: Infer Location Via a Page’s URL While most of the other studies tried to determine a page’s geographic focus based on its content, “Location-Based Ranking of Search Engine Results” (2003) by Watters et al.5 tried to determine that focus based on the page’s URL. It used three techniques: examining the top-level domain (for instance, “.uk” would suggest a page located in the United Kingdom); looking at Whois information, which is the registration data associated with the domain name; and using a tool called IPtoLL that maps IP addresses to latitude and longitude coordinates. The system matched Web pages to the correct geographic origin with a surprisingly high frequency—95%. For pages in U.S. and Canada, it pinpointed location down to the state or province level; for other localities, it pinpointed location down to the country level. To 5 measure correctness, the researchers compared addresses and other geographic information in a page to the URL-based classification. The URL-based classification strategy avoids the uncertainty and cost involved with disambiguating the geographic information extracted from the content of Web pages. But it does this at the expense of precision, since the URL-based strategy could only assign a geographic label down to the level of state, province, or country. I’m skeptical of the value of such a system in the real world, since users interested in geographically sensitive results will probably want them down to the city level or lower. (How would pages localized only to California help me with my “homes for sale” search in Berkeley?) The technique might be more useful as a first-pass strategy. If the system knew a page’s country focus, it might be able to more easily disambiguate meanings related to the more granular city or neighborhood level. Strategy #6: Analyze the Search Query to Understand Geographic Intent Rather than focusing on the geographic content of Web documents to improve geographic search, “Categorizing Web Queries According to Geographical Locality” (2003) by Gravano et al.6 tried to understand the intent associated with a user’s query. Specifically, it attempted to distinguish queries that were global (seeking information that is not specific to a geographic location) from those that were local (seeking information that is specific to geographic location). Determining a query’s geographic intent is a challenge since Web queries tend to be short. (According to the paper, in an analysis of more than 2 million queries, 85% were five words are less.) Because of this limitation, the study attempted to determine a query’s local or global intent in a somewhat backward fashion—it analyzed the results set produced by the query. If a results set included mostly non-geographic documents, the system would label the query as global. If a results set included mostly documents specific to locations, the system would label the query as local. If a query was local, the system would give precedence to location-specific pages within the results set, which would theoretically be more relevant to the user. The study compared a variety of machine-learning systems for classifying Web pages as global or local. On the whole, the systems labeled most (83-89%) of the Web pages tested as global, with varying levels of accuracy. Optimal performance wasn’t necessarily correlated with the more complex classification systems. For instance, the “Ripper” system produced some of the best results using only one simple rule: if the average number of city locations per returned Web page in the results exceeded a threshold, the query was labeled as local. Otherwise it was labeled global. Once the global or local intent of a query was established—based on the initial results— the system would process those results and return them to the user. If the query was global, the system returned the results as-is. If the query was local and there was already a geographic term in the query, the system also returned the results as-is (since the results 6 are already localized). If the query was local and didn’t include a geographic term, the system would take steps to transform the results to fit the user’s geographic location. I thought the most interesting aspect of the system was that it attempts to determine intent on its own, without burdening the user with additional questions or UI widgets. As Google discovered with its clean, uncluttered interface, less can be better. This study was also interesting in that it produced hard numbers about how often people actually care about geography when performing a search (they care about 85% of the time). This means that while geography is an important aspect to consider in search, it shouldn’t be one that takes priority in all or even most searches. 1 R. Larson. Geographic Information Retrieval and Spatial Browsing. GIS and Libraries:32nd Annual Clinic on Library Applications of Data Processing Conference, University of Illinois at Urbana-Champaign. 1995. 2 M. Egenhofer. Toward the Semantic Geospatial Web. Paper presented at the Tenth ACM International Symposium on Advances in Geographic Information Systems, McLean, Virginia. 2002. 3 B. Martins, M. J. Silva, and L. Andrade. Indexing and Ranking in Geo-IR Systems. Workshop on Geographic Information Retrieval, CIKM ’05, Germany. 2005. 4 A. Markowetz, Y.-Y. Chen, T. Suel, X. Long, and B. Seeger. Design and Implementation of a Geographic Search Engine. Technical Report TR-CIS-2005-03. 2005. 5 C. Watters and G. Amoudi. GeoSearcher: Location-Based Ranking of Search Engine Results. Journal of the American Society for Information Science and Technology, 54(2):140–151. 2003. 6 L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein. Categorizing Web Queries According to Geographical Locality. Proc. of the 12th CIKM. 2003.