Review of Geographic Search Engines

advertisement
1
SIMS 290-2: Search Engines (Fall 2005)
Professor Marti Hearst
Midterm Literature Review
Mike Wooldridge (mikew@sims.berkeley.edu)
Web Search and Geographic Location
Location can be very important when you’re performing a Web search. If you type in
“homes for sale” in the query box at Google or Yahoo!, you’re probably looking for
information about homes for sale in a specific geographic region. More often than not,
you’ll want listings that are near where you are now. Search engines that can understand
when geographic location is important to the user, and can interpret the location
information in Web documents to find the ones that are most relevant, will be able to
return better search results.
This literature review looks at a number of strategies for improving geographic search on
the Web. As the literature shows, improvement can come from different places in the
search process: from tweaking the way search engines index the content they find on the
Web, to fine-tuning how they interpret the queries submitted by the user, to revamping
the methods they use when matching queries to documents.
Adding geographic intelligence to search engines will become increasingly important in
our increasingly mobile world, where more and more people will be accessing the Web
from cell phones and other devices to find information specific to where they are at that
moment.
Strategy #1: Learn From Libraries
Written in 1995, Larson’s “Geographic Information Retrieval and Spatial Browsing”1
looks at geographic search in the context of digital libraries. The challenges and
techniques described are similar to those in later papers that examine Web search. This
makes sense when you consider that most of the strategies underpinning today’s Web
search engines arose out of search technology developed for libraries.
Central to Larson’s paper is the distinction between deterministic and probabilistic
retrieval. With deterministic retrieval, a system returns results based on the exact match
of query terms to the data contained in a collection. A user request is interpreted at face
value, and results are returned that satisfy the request. Probabilistic retrieval carries more
uncertainty, since it generally involves a subjective interpretation of what is relevant to a
user. The interpretation might involve expansion of query terms using a thesaurus or
categorization of documents in a collection based on subject headings.
In geographic information retrieval, different types of searches can have deterministic or
probabilistic properties. For instance, a “point-in-polygon” query, which asks if a given
coordinate is within a containing region, is relatively deterministic. However, a query that
asks what cities are within a certain distance of a landmark might involve a more
2
probabilistic interpretation. The system might need to decide whether to measure distance
from a city’s center or from its border, and what to do when there is overlap between the
regions being compared.
The paper also discusses how a document’s vocabulary can lead to imprecision with
regard to geographic search. For instance, geographic terms may refer to more than one
place in the world (“San Jose”) or vary in spelling depending on the language and
historical time period (“Peking” and “Beijing”). Faced with such ambiguity, designers of
geographic information systems face a significant challenge when deciding how to
convert geographic information in a document to precise geographic coordinates.
Finally, the article describes GIPSY, an indexing tool that constructs visual, grid-based
representations of documents based on the cities, countries, and other geographic terms in
them. GIPSY’s representations are similar to the “geographic footprints” developed by
Markowetz et al. and discussed later.
Strategy #2: Employ Semantic Technologies to Understand Geographic Context
In “Toward the Semantic Geospatial Web” (2002),2 Egenhofer proposes the Semantic
Web as a way to help people find location-related documents online. The Semantic Web
aims to add a new layer of meaning to HTML documents by using technologies such as
RDF and XML to describe the relationships among the concepts contained in the
documents. For instance, for the phrase “lakes in Maine,” semantic information could
describe how “lake” is a body of water, “Maine” is a state in the northeastern U.S., and
“in” describes the relative location of the two concepts. The Semantic Web could allow
search engines to move beyond simple keyword-based analysis and “understand”
documents in ways that are similar to how humans understand them.
The paper also explores how to improve geographic search by bettering the way users
request information from search engines. Egenhofer describes the notion of a “geospatial
request,” which is constructed using geographic terms (such as “lake” and “Maine”)
combined with comparators (such as “in”) that describe the relationship of the terms.
Users would use such requests to define their geographic information needs concisely.
According to Egenhofer, the combination of a Semantic Web of documents and a
geospatial method for querying them will form the Semantic Geospatial Web, a system
that allows better retrieval of Web documents based on their geographic characteristics.
While it may be a while until we have a Semantic Web available to help us search for
geographic content, I can imagine Web authoring tools incorporating the ability to add
geographic cues (in the form of RDF) to Web content. If deployed on a large scale, such
cues could make it much easier to find documents related to specific locations. I’m more
skeptical of users learning to formulate their requests with highly structured geographic
queries, considering the difficulties the average person has understanding and using
techniques such as Boolean logic.
Strategy #3: Intelligently Combine Thematic and Geographic Search
3
In “Indexing and Ranking in Geo-IR Systems” (2005) by Martins et al.,3 the authors
propose to meet the challenges of geographic search by building two separate types of
search indexes: keyword-based—or thematic—indexes, like the ones that power most of
today’s popular search engines, and geographic indexes that relate geographic locations
to particular Web documents.
To create a thematic index, a search engine performs a crawl of the Web, extracting
content and building a lookup table that associates each keyword with the pages in which
it appears. When the search engine receives a search request, it matches the query terms
with keywords in the thematic index and returns a list of pages relevant to the query.
(Additional ranking methods, including link-related algorithms such as PageRank, can
also be added.)
As the paper describes, you can also build an index for geographic terms similar to how
you build an index for the keywords in thematic search. Instead of relating keywords to
documents, such an index relates geographic terms to documents. During a geographic
search, the search system analyzes the query for geographic information and then returns
a ranked list of documents based on matches found in the geographic index.
If you have a search engine with both thematic and geographic components, you can
combine them in a variety of ways. The system can use the geographic component as a
filter on results retrieved from the thematic index, or vice versa. It can also give the two
components different weights: geography could be emphasized for some searches
(“homes for sale”) and de-emphasized for others (“home repair tips”).
I thought this was a smart strategy, since the relevance of geography definitely varies
depending on a user’s goal. (As we see in the Gravano paper, typically only 15% of Web
searches are local in nature and require geographic analysis.) Weighting could also be
applied automatically based on a user’s context. For instance, a search made from a
mobile phone could bring about a stronger geographic weighting.
Strategy #4: Attach “Footprints” to Pages to Represent Their Geographic Focus
In “Design and Implementation of a Geographic Search Engine” (2005) by Markowetz et
al.,4 the authors describe a search engine prototype and its performance with respect to 31
million Web pages in the German .de domain. To provide geographic search, the system
extracts location-related terms from crawled pages, matches the terms to geographic
coordinates, and then uses those coordinates to build a geographic “footprint” for each
page. Each footprint is a 1,024x1,024 grid laid over a map of Germany, with each
element in the grid containing a relevance value. By comparing the footprints of Web
pages with a footprint defined by the user when submitting a query, the search engine
attempts to produce a geographically relevant set of search results.
A significant challenge that the authors encountered—a challenge also described by
Larson—involved accurately assigning words to geographic coordinates on the German
4
map. This is because names of towns and other geographic features can often double as
common German words (for instance, “Oder” is both the name of a river and the word
“or”). To map the town names accurately, the system classified geographic names as
strong (if they were unambiguous) or weak (if they were ambiguous). A weak name
required a strong name in proximity to be counted as relevant; otherwise it was ignored.
The system also classified some words as killer terms, which would cause strong names
in their vicinity to be ignored. (“Mr.” is a killer term since it usually comes before a
person’s name.)
Another challenge involved the standard practice of publishing address information for a
Web site on a “Contact” page. If pages were labeled based only on their own content,
Contact pages would be given much more geographical importance than was practical.
The authors mitigated this by using a method called “geo propagation,” where a page
inherits dampened versions of the footprints of pages that are one or two links away from
them. Thus the final footprint assigned to a Web page by the system was based on its own
content as well as the content in its neighbors.
In my opinion, building geographic footprints appears to be a nice visual way to represent
the sometimes complex geographic data associated with a Web page. I can imagine an
interactive search engine that allows users to construct a footprint to associate with their
query using a touch screen or a pen tablet. A search engine such as the one in this paper
could also remember the more popular or useful footprints, and offer them to users in
advance.
I also found some of the geographic idiosyncrasies common to the German test bed
interesting. For instance, most German addresses don’t include a state, which makes
disambiguation of city names on a page more difficult. But German laws also require
Web sites to include contact information a minimum of two clicks away from every site
page, which makes geographic inferences easier. Such idiosyncrasies suggest that
differences between countries could present a challenge when trying to optimize a
geographic search engine across borders.
Strategy #5: Infer Location Via a Page’s URL
While most of the other studies tried to determine a page’s geographic focus based on its
content, “Location-Based Ranking of Search Engine Results” (2003) by Watters et al.5
tried to determine that focus based on the page’s URL. It used three techniques:
examining the top-level domain (for instance, “.uk” would suggest a page located in the
United Kingdom); looking at Whois information, which is the registration data associated
with the domain name; and using a tool called IPtoLL that maps IP addresses to latitude
and longitude coordinates.
The system matched Web pages to the correct geographic origin with a surprisingly high
frequency—95%. For pages in U.S. and Canada, it pinpointed location down to the state
or province level; for other localities, it pinpointed location down to the country level. To
5
measure correctness, the researchers compared addresses and other geographic
information in a page to the URL-based classification.
The URL-based classification strategy avoids the uncertainty and cost involved with
disambiguating the geographic information extracted from the content of Web pages. But
it does this at the expense of precision, since the URL-based strategy could only assign a
geographic label down to the level of state, province, or country. I’m skeptical of the
value of such a system in the real world, since users interested in geographically sensitive
results will probably want them down to the city level or lower. (How would pages
localized only to California help me with my “homes for sale” search in Berkeley?)
The technique might be more useful as a first-pass strategy. If the system knew a page’s
country focus, it might be able to more easily disambiguate meanings related to the more
granular city or neighborhood level.
Strategy #6: Analyze the Search Query to Understand Geographic Intent
Rather than focusing on the geographic content of Web documents to improve
geographic search, “Categorizing Web Queries According to Geographical Locality”
(2003) by Gravano et al.6 tried to understand the intent associated with a user’s query.
Specifically, it attempted to distinguish queries that were global (seeking information that
is not specific to a geographic location) from those that were local (seeking information
that is specific to geographic location).
Determining a query’s geographic intent is a challenge since Web queries tend to be
short. (According to the paper, in an analysis of more than 2 million queries, 85% were
five words are less.) Because of this limitation, the study attempted to determine a
query’s local or global intent in a somewhat backward fashion—it analyzed the results set
produced by the query. If a results set included mostly non-geographic documents, the
system would label the query as global. If a results set included mostly documents
specific to locations, the system would label the query as local. If a query was local, the
system would give precedence to location-specific pages within the results set, which
would theoretically be more relevant to the user.
The study compared a variety of machine-learning systems for classifying Web pages as
global or local. On the whole, the systems labeled most (83-89%) of the Web pages
tested as global, with varying levels of accuracy. Optimal performance wasn’t necessarily
correlated with the more complex classification systems. For instance, the “Ripper”
system produced some of the best results using only one simple rule: if the average
number of city locations per returned Web page in the results exceeded a threshold, the
query was labeled as local. Otherwise it was labeled global.
Once the global or local intent of a query was established—based on the initial results—
the system would process those results and return them to the user. If the query was
global, the system returned the results as-is. If the query was local and there was already
a geographic term in the query, the system also returned the results as-is (since the results
6
are already localized). If the query was local and didn’t include a geographic term, the
system would take steps to transform the results to fit the user’s geographic location.
I thought the most interesting aspect of the system was that it attempts to determine intent
on its own, without burdening the user with additional questions or UI widgets. As
Google discovered with its clean, uncluttered interface, less can be better. This study was
also interesting in that it produced hard numbers about how often people actually care
about geography when performing a search (they care about 85% of the time). This
means that while geography is an important aspect to consider in search, it shouldn’t be
one that takes priority in all or even most searches.
1
R. Larson. Geographic Information Retrieval and Spatial Browsing. GIS and
Libraries:32nd Annual Clinic on Library Applications of Data Processing Conference,
University of Illinois at Urbana-Champaign. 1995.
2
M. Egenhofer. Toward the Semantic Geospatial Web. Paper presented at the Tenth
ACM International Symposium on Advances in Geographic Information Systems,
McLean, Virginia. 2002.
3
B. Martins, M. J. Silva, and L. Andrade. Indexing and Ranking in Geo-IR Systems.
Workshop on Geographic Information Retrieval, CIKM ’05, Germany. 2005.
4
A. Markowetz, Y.-Y. Chen, T. Suel, X. Long, and B. Seeger. Design and
Implementation of a Geographic Search Engine. Technical Report TR-CIS-2005-03.
2005.
5
C. Watters and G. Amoudi. GeoSearcher: Location-Based Ranking of Search Engine
Results. Journal of the American Society for Information Science and Technology,
54(2):140–151. 2003.
6
L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein. Categorizing Web Queries
According to Geographical Locality. Proc. of the 12th CIKM. 2003.
Download