Uploaded by johnsbx16

The Geographic Awareness Tool

advertisement
Library Hi Tech News
Emerald Article: The Geographic Awareness Tool: techniques for
geo-encoding digital library content
James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez, Tamara
McMahon
Article information:
To cite this document: James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez, Tamara McMahon, (2010),"The Geographic
Awareness Tool: techniques for geo-encoding digital library content", Library Hi Tech News, Vol. 27 Iss: 9 pp. 5 - 9
Permanent link to this document:
http://dx.doi.org/10.1108/07419051011110586
Downloaded on: 28-07-2012
References: This document contains references to 9 other documents
To copy this document: permissions@emeraldinsight.com
This document has been downloaded 825 times since 2011. *
Users who downloaded this Article also downloaded: *
François Des Rosiers, Jean Dubé, Marius Thériault, (2011),"Do peer effects shape property values?", Journal of Property
Investment & Finance, Vol. 29 Iss: 4 pp. 510 - 528
http://dx.doi.org/10.1108/14635781111150376
Hui Chen, Miguel Baptista Nunes, Lihong Zhou, Guo Chao Peng, (2011),"Expanding the concept of requirements traceability: The role
of electronic records management in gathering evidence of crucial communications and negotiations", Aslib Proceedings, Vol. 63
Iss: 2 pp. 168 - 187
http://dx.doi.org/10.1108/00012531111135646
Charles Inskip, Andy MacFarlane, Pauline Rafferty, (2010),"Organising music for movies", Aslib Proceedings, Vol. 62 Iss: 4 pp.
489 - 501
http://dx.doi.org/10.1108/00012531011074726
Access to this document was granted through an Emerald subscription provided by CALIFORNIA STATE UNIVERSITY SAN JOSE
For Authors:
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service.
Information about how to choose which publication to write for and submission guidelines are available for all. Please visit
www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
With over forty years' experience, Emerald Group Publishing is a leading independent publisher of global research with impact in
business, society, public policy and education. In total, Emerald publishes over 275 journals and more than 130 book series, as
well as an extensive range of online products and services. Emerald is both COUNTER 3 and TRANSFER compliant. The organization is
a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive
preservation.
*Related content and download information correct at time of download.
The Geographic Awareness Tool: techniques for
geo-encoding digital library content
James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez and
Tamara McMahon
Introduction
Geo-encoding and geo-querying
bibliographic metadata is still in its
infancy. There are many factors to
consider when designing a geo-reference
interface for library data. One is
appropriateness – Is geo-encoding likely
to yield any additional information for a
user? Another is discovery – How do you
determine what geographic locations are
associated with the intellectual content of
a particular digital object? Another is
when or if to use augmentation – Are
there services which can help you
improve the amount of geo-encoded data
you can provide? and Is the granularity
with which they can describe a
geographic region sufficient for your
purposes? Still another is the interface –
What tools will you offer for exploring
geo-encoded metadata? and What data
formats do they support? And most
importantly, what is the motivation for
adding geo-reference capabilities?
Motivation
Geo-location
capabilities
are
becoming ubiquitous and increasingly
sophisticated. A steady stream of
consumer electronics is increasing geolocation literacy skills among the public.
Travelers, hikers, and hobbyists use
augmented digital maps which overlay
geo-encoded information. These digital
map tools are supplied by navigation
systems, cell phones, and web sites.
Geo-encoding is a technique that is
increasingly used to enable augmented
reality applications and geographic
searching of data sets that include georeference data. For example, popular
smart phones based on the Android and
Apple iOS operating system offer geoenabled applications. In their simplest
form, the handset hardware combines a
global positioning system sensor with
an electronic compass, which enables
applications to know where the user is,
and which direction they are facing. If
you have ever tried to use a paper map
without a compass, then you have had
the experience of spinning the map
around trying to line it up with terrain or
a curve in the road ahead. With a device
that knows both where you are and what
direction you are pointed, applications
can generate map views that correspond
to your location and heading. And with
data from services such as Google Street
View, a location-enabled application
can show you a view of what you should
see at human eye level, which can
change as your orientation changes.
Overlays of geo-referenced data, that is,
any metadata which includes a
geographic component, can show who
or what is nearby. In a social context,
such as Facebook Places, users share
their location with each other, but in the
larger geographic context, various users
and entities supply geo-encoded data
that are then overlaid onto a map based
on a user’s request. It is this last
capability which libraries can leverage,
by ferreting out geospatial data in
metadata or full-text content, and
generating compatible output for these
new systems.
Geo-referenced metadata
A growing number of digital library
projects are experimenting with georeferenced data to take advantage of the
ubiquity and popularity of geographic
services. Geo-referenced metadata
provides users with another way of
exploring library collections, and is
even more useful when an increasing
amount of the content is also available
digitally. There are a number of
challenges with offering this service:
surveys show that at best, only about 35
percent of metadata records contain georeference data that indicate the place(s)
associated with an item (Petras).
Although the MARC bibliographic
format includes fields for specifying this
information, geo-reference data, if
present at all, often ends up in subject
fields, or is available but opaque-buried
in the abstract.
In our approach, we have been
experimenting with several approaches for
both on-the-fly geo-reference processing,
and batch geo-reference augmentation of
harvested metadata content. Our various
implementations of what we call the
‘‘Geographic Awareness Tool’’ (GAT)
have the following basic characteristics –
they support end-user searching. One
notable version of the GAT implemented
a JITIR ( just-in-time information
retrieval) technique (Rhodes and Maes,
2000). With a JITIR search tool, users do
not need to explicitly perform a search, as
the query is formulated and executed on
the fly. Instead, user queries are derived
from their context. For example, if the
users are composing a blog post, the text
they enter is used to generate queries.
Users search the textual portion of the geoencoded content, such as titles or
descriptions, just as they would search a
library catalog. The GAT incarnations
incorporate externally geo-encoded data.
These data (whether RSS newsfeeds or
bibliographic data) have been previously
harvested and augmented with appropriate
geo-coordinates that relate to the textual
content, through various techniques.
Finally, the GAT overlay this
geo-encoded data onto a map, using
Web 2.0 technologies, including AJAX
(see Figure 1).
Methodology and approach
This section provides detailed
descriptions of the different approaches
LIBRARY HI TECH NEWS Number 9/10 2010, pp. 5-9, # Emerald Group Publishing Limited, 0741-9058, DOI 10.1108/07419051011110586
5
Figure 1. Overview of GAT setup to overlay marker on the map
that we used to parse and collect data,
and discover location information.
Using RSS feed-based approach
To test the feasibility of the project,
we decided to initially design our
prototype around RSS feed for avian flu.
The World Health Organization (WHO)
is tasked by the United Nations to
monitor disease outbreaks, especially
infectious diseases with the potential to
spread across geo-political boundaries.
As part of this mission, the WHO
provides RSS newsfeeds to the public,
which contain information about
potential threats.
In 2007, we began to pursue two
geographically augmented information
retrieval projects to evaluate the potential
of Google maps to provide access to RSS
feeds, and search results from a metadata
search service. A sampling of RSS feeds
regarding reports of avian flu from the
WHO web site were gathered and parsed
for location information. At that time,
many experts considered the avian flu
virus, H5N1, a top candidate for leaping
across the species barrier and initiating a
major global flu pandemic. Each WHO
newsfeed was mapped to geographic
coordinates, using the Google maps
API. This process generated an XML
marker file, which is a simple format
for specifying location and associated
display information. The inbuilt code in
6
Google maps interprets the XML file
data to generate data marker overlays
onto its views (maps, satellite, or earth).
The marker file was hosted on a
web server and could be periodically
regenerated as new data arrived in the
WHO feed (Figure 2).
Using digital library-based approach
At the same time, the Los Alamos
National Lab (LANL) Research Library
was undertaking a major initiative to
create a new service for exposing its
collection of 93 million bibliographic
metadata records which describe peerreviewed scientific papers from around
the world. This content is locally loaded
into a digital repository called aDORe
(Sompel et al., 2005).
The LANL aDORe repository is a
high performance, scalable architecture
for managing distributed repositories of
metadata. Records from various vendors
are normalized and mapped into a
standard MARC XML format, and these
records are then placed in a MPEG 21
DIDL wrapper. Collections of DIDL
records are then stored in XML tape files,
and each XML tape functions as a
distinct repository, and this content may
be harvested using the Open Archives
Initiative Protocol for Metadata
Harvesting. Individual records can be
retrieved from tapes via the aDORe
disseminator
component.
The
architecture is massively scalable, yet
offers performance that enables retrieval
of individual records from the repository
in real time.
Our
large
collection,
highperformance repository, and the
institution’s strong physical sciences
leanings provided opportunity and
incentive to investigate the possibility of
offering
geo-reference
exploration
services alongside the more traditional
methods for exploring bibliographic
metadata. The search service leveraged
REST-based services for search and data
retrieval, exposed via OpenURLs,
chained together as needed via a work
flow. So we developed an OpenURL
addressable geo-reference service that
Figure 2. RSS feed-based approach for GAT
LIBRARY HI TECH NEWS Number 9/10 2010
could take a small result set (10-25 items),
retrieve and parse each record to
determine if it contained place
information, call geo-names to retrieve
coordinates for each place, and then
aggregate the geo-reference information
and return it as an XML marker file of the
same format as the WHO marker files
used for the RSS feed project. The subset
of results that contained geographic
information was then overlaid onto a map
and navigable using a web browser and
Google maps (Figure 3).
We leveraged the aDORe system to
perform the augmentation. For each item
in a result set, there was a unique identifier
that could be used to retrieve the DIDL
from the repository, which contained the
MARC XML record for that item. We
then used XPath statements to locate and
inspect the contents of a portion of the
MARC record that would contain georeference information. The first was the
MARC 034 field, which the MARC
specification identifies as the location for
‘‘Coded Cartographic Mathematical
Data’’(Library of Congress). Longitude
and latitude data can be incorporated into
the 034 field in subfields d and e
(longitude), f and g (latitude). An XPath
statement such as this one could be used
to retrieve coordinate data from the 034
field:
Statement syntax: //marc:datafield[@
tag¼‘034’]/marc:subfield[@code¼‘d’].
When this field/subfield combination
occurred in a MARC record for an item
in the result set, we could use it as is in
the geo-reference version of the results.
Another field of interest was the 651
field, retrievable from MARC XML
using this XPath statement:
Statement syntax: //marc:datafield[@
tag¼‘651’]/marc:subfield[@code¼‘a’].
When this field occurs, subfield ‘‘a’’
will contain the place name(s)
associated with an item described by the
MARC metadata record. The contents
of subfield ‘‘a’’ within 651 field were
what we would send off to the
geonames.org (GeoNames) place name
lookup service.
As expected, the geo-referenced
result set was typically a small subset of
the results that were displayed in
response to a query. Most bibliographic
records
simply
lacked
usable
geographic coordinate data. End-user
geo-reference exploration was also an
uneven experience. It tended to be
highly dependent upon the topic of the
query – while many biological and
geophysical phenomena relate to
locations on the globe, chemistry and
physics content, for example, is far less
likely to have any relation to
geographic locations. As a result, a
query for ‘‘karst regions’’ (geological
formations affected by water) or
‘‘disaster mitigation efforts’’ or
‘‘migration
routes
for
African
antelope’’ yields far better results than
queries for keywords like ‘‘neutrinos’’
or ‘‘viral replication’’. We concluded
that managing user expectations, along
with the user’s own common sense,
would play a role in user satisfaction
with a geo-referenced view of
bibliographic metadata search results.
But we also believed that more work
was necessary to span the gap between
the explicit geographic information
included in the metadata, and geo-
Figure 3. Digital library-based approach for GAT
LIBRARY HI TECH NEWS Number 9/10 2010
reference information that was not
explicitly tagged as such.
GAT: a proactive approach to assemble
topically focused collection
Meanwhile, our team was beginning
to explore how to deal retrieving and
integrating metadata from various
sources, with a goal of offering a suite of
awareness tools that would expose these
data either via a traditional search work
flow, or in real time via automatically
generated queries based upon the user’s
context. The awareness tools would
retrieve and display data based upon
automatic queries based on text the user
entered while they were composing
content, whether for a blog post,
discussion forum, email message, or
other purpose, especially for users who
would not have the time to explicitly
perform searches, such as in the case of
an emergency.
To maximize the utility of these
tools, we decided to create tools that
would enable rapid assembly of
topically focused collections, either by
harvesting content from a specific
topical repository, or by retrieving
results of a query against our own large
bibliographic metadata collection, and
storing those results as a discrete
searchable collection which could be
explored using the various tools we
would develop. We decided to map the
harvested metadata records into RDF
triples, and use a simple multi-layer
architecture to support their exploration.
A middleware service layer would allow
query and structured content retrieval,
but leave it to another application or
layer to handle rendering, while a
second service layer would combine
results with a rendering tool suitable for
end users. For the GAT, it made sense to
have the middleware layer be able to
return results in the Google maps XML
marker format, and the KML format
(Google) which is supported by Google
Earth (see Figure 4), and other
geospatial visualization tools.
As the architecture (see Figure 5) and
toolset took shape, we returned to the
problem of how we might improve the
percentage of records that included georeference metadata. We knew from
cursory inspection of the harvested
records that many more records had the
potential to be explored geospatially
7
Figure 4. Google Earth View for GAT
Figure 5. Architecture for GAT: raw metadata is converted and augmented to
support various geographic visualization techniques
than those that explicitly included
geo-reference data.
Approach adopted to identify location
names from text sources
Two rich potential sources of place
information were titles and abstracts, but
finding locations in this data would
8
require sophisticated text processing
capabilities. We explored the Yahoo
term extraction service, but while it was
capable of sometimes identifying a
place name as a term of significance, it
was not able to tell us when a particular
piece of information was a place. We
had been exploring both the Yahoo
term extraction service, and Calais, to
see if they improved automatic query
construction
for
the
awareness
capabilities.
Calais, now OpenCalais, provided a
much more detailed analysis of content
provided to it. ‘‘OpenCalais uses natural
language processing (NLP) to read an
article, extracting the ‘who, what, when,
where and how’’’ for textual content
(Thomas). One category of information
it was able to identify in free text content
were locations. Not only was it able to
identify locations as such, it also
returned the geographic coordinates for
the locations. We found that combining
location information explicitly provided
in the metadata we harvested with georeference information identified by calls
to OpenCalais, vastly improved the
utility of our GAT. Indeed, the number
of records that were augmented in this
fashion that included geo-reference data
increased by one or two orders of
magnitude, depending on the topic of
the search or repository contents (will
include a table here).
OpenCalais
supports
several
interfaces for submitting requests. We
chose to use the REST API and
submitted both the object title and its
abstract as the value for the content
parameter, which is the mechanism for
submitting text to Calais for parsing and
entity extraction. Calais requires a
minimum of 100 characters in order to
determine the content’s language and
have enough context in order to identify
entities such as people and places. The
upper limit on a content value is 100,000
characters, making it ideal for abstract
content analysis. The request API is easy
to use, though limited to 50,000 requests
per day. The response can be returned in
several formats. We elected to receive
RDF/XML responses, which we parsed
with Xpath expressions to locate linked
data that was returned by the service. We
used place data to build local georeference triples that correspond to the
coordinates identified by Calais. Calais
in turn relies on other linked data services
such as geo-names, for this data. It excels
at entity extraction, with a strong
emphasis on corporate data.
Discussion
Since our work involves fusion of
data from disparate sources, RDF is the
format to which we normalize all our
data. In the linked data web, it is possible,
LIBRARY HI TECH NEWS Number 9/10 2010
indeed encouraged, that one reference
other linked data, rather than duplicating
effort. On the one hand, this makes sense,
because, as with the data normalization
problem in relational databases,
duplicating data may yield performance
advantages but it also leaves the door
open to inevitable data synchronization
problems. On the other hand, it may be
crucial that the coordinates for a
geographic location be explicitly
captured and stored locally in triples
associated with other aspects of the
metadata object. Geo-political lines are
notorious for being redrawn over time, so
while it might sound like a good idea to
link to a (Dbpedia) record about
Copenhagen, Denmark, what happens if
the ‘‘Danes’’ redraw the lines of their
city? Is the appropriate geographic locale
the original boundaries of the city, or is it
sufficient to point to Copenhagen? And
this does not take into account additional
geographic dimensions such as elevation
or depth. There are plants for example,
which occur within a very narrow
climatic band at a particular elevation on
a single mountain, and this is not an
uncommon situation, and yet that is
crucial information for researchers.
Augmenting metadata is a tricky
proposition, but with the low number of
bibliographic records which have been
explicitly cataloged with geographic data,
it is essential if you want to support georeference querying and browsing of the
data. There are a couple of issues to keep
in mind, however. One is that the
granularity of geo-coding is still rather
low, even with Calais. Geo-coordinates
are typically provided as a single pair of
longitude and latitude coordinates,
representing a point on the globe, rather
than an actual outline of a region.
Ironically, the precise coordinates might,
in some cases, be sensitive information.
For example, community of scientists
who study Karst terrain often keep the
precise locations of cave formations
secret, either at the request of a private
LIBRARY HI TECH NEWS Number 9/10 2010
land owner, or to prevent curious
spelunkers from visiting an ecologically
sensitive cave. Yet, the pinpoint
coordinates offered by geo-names or
OpenCalais are not always terribly
revealing. If for example, a bibliographic
record describes a study of amphibians in
California, the coordinates returned for
‘‘Calerveras County’’ will likely be the
center of that county. This may be enough
for some users, but not enough for others,
who might need a set of coordinates that
define an arbitrary, but significant region
of the globe, that is related to the object
being described by the metadata.
Conclusion
Many libraries have substantial
collections that would lend themselves to
geographic exploration, and end users are
becoming
increasingly
comfortable
dealing with data in the context of
physical places. In cases where the
metadata for these collections lacks
explicit geo-referenced data, augmentation
via entity extraction can fill this gap.
Libraries may choose to provide browse
and search capabilities in which results are
overlaid onto maps, but there are other
delivery strategies worth considering.
JITIR techniques allows geo-referenced
data to be retrieved and supplied to
location-aware mobile devices, for
example, in support of museum tours or
field research. In the context of e-science,
there is utility to offering services that
expose results as raw geo-encoded
metadata, since geo-encoded metadata
may be of more value to researchers
already equipped to deal with large scale
data analysis, including analysis of geoencoded data.
REFERENCES
DBpedia (n.d.), available at: http://dbpedia.
org/About
GeoNames
(n.d.),
www.geonames.org/
available
at:
Google (n.d.), Keyhole Markup Language
Developer’s Guide, available at: http://
code.google.com/apis/kml/documentation/
topicsinkml.html
Library of Congress (n.d.), ‘‘034 – coded
cartographic mathematical data’’, available
at: www.loc.gov/marc/bibliographic/concise/
bd034.html
OpenCalais (n.d.),
opencalais.com/
available
at:
www.
Petras, V. ‘‘Statistical analysis of geographic
and language clues in the MARC record’’,
available at: http://metadata.sims.berkeley.
edu/papers/Marcplaces.pdf
Rhodes, B. and Maes, P. (2000), ‘‘Just-inTime information retrieval agents’’, IBM
Systems Journal, Vol. 39 Nos 3/4, pp. 685704.
Sompel, H., Bekaert, J., Liu, X., Balakireva,
L. and Schwander, T. (2005), ‘‘aDORe: a
modular, standards-based digital object
repository’’, The Computer Journal, Vol. 48
No. 5, pp. 514-35, available at: http://arxiv.
org/abs/cs.DL/0502028
Thomas, K. (n.d.), ‘‘Thomson Reuters
OpenCalais service adopted by the
Huffington Post, DailyMe and Associated
Newspapers’’, available at: www.opencalais.
com/press-releases/thomson-reuters-opencal
ais-service-adopted-huffington-post-dailymeand-associated-new
James Powell (jepowell@lanl.gov)
is a Research Technologist at the
Research Library, Los Alamos National
Laboratory, Los Alamos, New Mexico,
USA.
Ketan Mane (kmane@renci.org) is a
Senior Research Informatics Developer
at the Renaissance Computing Institute
(RENCI), University of North Carolina,
Chapel Hill, North Carolina, USA.
Linn Marks Collins (linn@lanl.gov),
Mark L.B. Martinez (mlbm@lanl.gov)
and Tamara McMahon (tmcmahon@
lanl.gov) are based at the Research
Library, Los Alamos National Laboratory,
Los Alamos, New Mexico, USA.
9
Download