Library Hi Tech News Emerald Article: The Geographic Awareness Tool: techniques for geo-encoding digital library content James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez, Tamara McMahon Article information: To cite this document: James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez, Tamara McMahon, (2010),"The Geographic Awareness Tool: techniques for geo-encoding digital library content", Library Hi Tech News, Vol. 27 Iss: 9 pp. 5 - 9 Permanent link to this document: http://dx.doi.org/10.1108/07419051011110586 Downloaded on: 28-07-2012 References: This document contains references to 9 other documents To copy this document: permissions@emeraldinsight.com This document has been downloaded 825 times since 2011. * Users who downloaded this Article also downloaded: * François Des Rosiers, Jean Dubé, Marius Thériault, (2011),"Do peer effects shape property values?", Journal of Property Investment & Finance, Vol. 29 Iss: 4 pp. 510 - 528 http://dx.doi.org/10.1108/14635781111150376 Hui Chen, Miguel Baptista Nunes, Lihong Zhou, Guo Chao Peng, (2011),"Expanding the concept of requirements traceability: The role of electronic records management in gathering evidence of crucial communications and negotiations", Aslib Proceedings, Vol. 63 Iss: 2 pp. 168 - 187 http://dx.doi.org/10.1108/00012531111135646 Charles Inskip, Andy MacFarlane, Pauline Rafferty, (2010),"Organising music for movies", Aslib Proceedings, Vol. 62 Iss: 4 pp. 489 - 501 http://dx.doi.org/10.1108/00012531011074726 Access to this document was granted through an Emerald subscription provided by CALIFORNIA STATE UNIVERSITY SAN JOSE For Authors: If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service. Information about how to choose which publication to write for and submission guidelines are available for all. Please visit www.emeraldinsight.com/authors for more information. About Emerald www.emeraldinsight.com With over forty years' experience, Emerald Group Publishing is a leading independent publisher of global research with impact in business, society, public policy and education. In total, Emerald publishes over 275 journals and more than 130 book series, as well as an extensive range of online products and services. Emerald is both COUNTER 3 and TRANSFER compliant. The organization is a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation. *Related content and download information correct at time of download. The Geographic Awareness Tool: techniques for geo-encoding digital library content James Powell, Ketan Mane, Linn Marks Collins, Mark L.B. Martinez and Tamara McMahon Introduction Geo-encoding and geo-querying bibliographic metadata is still in its infancy. There are many factors to consider when designing a geo-reference interface for library data. One is appropriateness – Is geo-encoding likely to yield any additional information for a user? Another is discovery – How do you determine what geographic locations are associated with the intellectual content of a particular digital object? Another is when or if to use augmentation – Are there services which can help you improve the amount of geo-encoded data you can provide? and Is the granularity with which they can describe a geographic region sufficient for your purposes? Still another is the interface – What tools will you offer for exploring geo-encoded metadata? and What data formats do they support? And most importantly, what is the motivation for adding geo-reference capabilities? Motivation Geo-location capabilities are becoming ubiquitous and increasingly sophisticated. A steady stream of consumer electronics is increasing geolocation literacy skills among the public. Travelers, hikers, and hobbyists use augmented digital maps which overlay geo-encoded information. These digital map tools are supplied by navigation systems, cell phones, and web sites. Geo-encoding is a technique that is increasingly used to enable augmented reality applications and geographic searching of data sets that include georeference data. For example, popular smart phones based on the Android and Apple iOS operating system offer geoenabled applications. In their simplest form, the handset hardware combines a global positioning system sensor with an electronic compass, which enables applications to know where the user is, and which direction they are facing. If you have ever tried to use a paper map without a compass, then you have had the experience of spinning the map around trying to line it up with terrain or a curve in the road ahead. With a device that knows both where you are and what direction you are pointed, applications can generate map views that correspond to your location and heading. And with data from services such as Google Street View, a location-enabled application can show you a view of what you should see at human eye level, which can change as your orientation changes. Overlays of geo-referenced data, that is, any metadata which includes a geographic component, can show who or what is nearby. In a social context, such as Facebook Places, users share their location with each other, but in the larger geographic context, various users and entities supply geo-encoded data that are then overlaid onto a map based on a user’s request. It is this last capability which libraries can leverage, by ferreting out geospatial data in metadata or full-text content, and generating compatible output for these new systems. Geo-referenced metadata A growing number of digital library projects are experimenting with georeferenced data to take advantage of the ubiquity and popularity of geographic services. Geo-referenced metadata provides users with another way of exploring library collections, and is even more useful when an increasing amount of the content is also available digitally. There are a number of challenges with offering this service: surveys show that at best, only about 35 percent of metadata records contain georeference data that indicate the place(s) associated with an item (Petras). Although the MARC bibliographic format includes fields for specifying this information, geo-reference data, if present at all, often ends up in subject fields, or is available but opaque-buried in the abstract. In our approach, we have been experimenting with several approaches for both on-the-fly geo-reference processing, and batch geo-reference augmentation of harvested metadata content. Our various implementations of what we call the ‘‘Geographic Awareness Tool’’ (GAT) have the following basic characteristics – they support end-user searching. One notable version of the GAT implemented a JITIR ( just-in-time information retrieval) technique (Rhodes and Maes, 2000). With a JITIR search tool, users do not need to explicitly perform a search, as the query is formulated and executed on the fly. Instead, user queries are derived from their context. For example, if the users are composing a blog post, the text they enter is used to generate queries. Users search the textual portion of the geoencoded content, such as titles or descriptions, just as they would search a library catalog. The GAT incarnations incorporate externally geo-encoded data. These data (whether RSS newsfeeds or bibliographic data) have been previously harvested and augmented with appropriate geo-coordinates that relate to the textual content, through various techniques. Finally, the GAT overlay this geo-encoded data onto a map, using Web 2.0 technologies, including AJAX (see Figure 1). Methodology and approach This section provides detailed descriptions of the different approaches LIBRARY HI TECH NEWS Number 9/10 2010, pp. 5-9, # Emerald Group Publishing Limited, 0741-9058, DOI 10.1108/07419051011110586 5 Figure 1. Overview of GAT setup to overlay marker on the map that we used to parse and collect data, and discover location information. Using RSS feed-based approach To test the feasibility of the project, we decided to initially design our prototype around RSS feed for avian flu. The World Health Organization (WHO) is tasked by the United Nations to monitor disease outbreaks, especially infectious diseases with the potential to spread across geo-political boundaries. As part of this mission, the WHO provides RSS newsfeeds to the public, which contain information about potential threats. In 2007, we began to pursue two geographically augmented information retrieval projects to evaluate the potential of Google maps to provide access to RSS feeds, and search results from a metadata search service. A sampling of RSS feeds regarding reports of avian flu from the WHO web site were gathered and parsed for location information. At that time, many experts considered the avian flu virus, H5N1, a top candidate for leaping across the species barrier and initiating a major global flu pandemic. Each WHO newsfeed was mapped to geographic coordinates, using the Google maps API. This process generated an XML marker file, which is a simple format for specifying location and associated display information. The inbuilt code in 6 Google maps interprets the XML file data to generate data marker overlays onto its views (maps, satellite, or earth). The marker file was hosted on a web server and could be periodically regenerated as new data arrived in the WHO feed (Figure 2). Using digital library-based approach At the same time, the Los Alamos National Lab (LANL) Research Library was undertaking a major initiative to create a new service for exposing its collection of 93 million bibliographic metadata records which describe peerreviewed scientific papers from around the world. This content is locally loaded into a digital repository called aDORe (Sompel et al., 2005). The LANL aDORe repository is a high performance, scalable architecture for managing distributed repositories of metadata. Records from various vendors are normalized and mapped into a standard MARC XML format, and these records are then placed in a MPEG 21 DIDL wrapper. Collections of DIDL records are then stored in XML tape files, and each XML tape functions as a distinct repository, and this content may be harvested using the Open Archives Initiative Protocol for Metadata Harvesting. Individual records can be retrieved from tapes via the aDORe disseminator component. The architecture is massively scalable, yet offers performance that enables retrieval of individual records from the repository in real time. Our large collection, highperformance repository, and the institution’s strong physical sciences leanings provided opportunity and incentive to investigate the possibility of offering geo-reference exploration services alongside the more traditional methods for exploring bibliographic metadata. The search service leveraged REST-based services for search and data retrieval, exposed via OpenURLs, chained together as needed via a work flow. So we developed an OpenURL addressable geo-reference service that Figure 2. RSS feed-based approach for GAT LIBRARY HI TECH NEWS Number 9/10 2010 could take a small result set (10-25 items), retrieve and parse each record to determine if it contained place information, call geo-names to retrieve coordinates for each place, and then aggregate the geo-reference information and return it as an XML marker file of the same format as the WHO marker files used for the RSS feed project. The subset of results that contained geographic information was then overlaid onto a map and navigable using a web browser and Google maps (Figure 3). We leveraged the aDORe system to perform the augmentation. For each item in a result set, there was a unique identifier that could be used to retrieve the DIDL from the repository, which contained the MARC XML record for that item. We then used XPath statements to locate and inspect the contents of a portion of the MARC record that would contain georeference information. The first was the MARC 034 field, which the MARC specification identifies as the location for ‘‘Coded Cartographic Mathematical Data’’(Library of Congress). Longitude and latitude data can be incorporated into the 034 field in subfields d and e (longitude), f and g (latitude). An XPath statement such as this one could be used to retrieve coordinate data from the 034 field: Statement syntax: //marc:datafield[@ tag¼‘034’]/marc:subfield[@code¼‘d’]. When this field/subfield combination occurred in a MARC record for an item in the result set, we could use it as is in the geo-reference version of the results. Another field of interest was the 651 field, retrievable from MARC XML using this XPath statement: Statement syntax: //marc:datafield[@ tag¼‘651’]/marc:subfield[@code¼‘a’]. When this field occurs, subfield ‘‘a’’ will contain the place name(s) associated with an item described by the MARC metadata record. The contents of subfield ‘‘a’’ within 651 field were what we would send off to the geonames.org (GeoNames) place name lookup service. As expected, the geo-referenced result set was typically a small subset of the results that were displayed in response to a query. Most bibliographic records simply lacked usable geographic coordinate data. End-user geo-reference exploration was also an uneven experience. It tended to be highly dependent upon the topic of the query – while many biological and geophysical phenomena relate to locations on the globe, chemistry and physics content, for example, is far less likely to have any relation to geographic locations. As a result, a query for ‘‘karst regions’’ (geological formations affected by water) or ‘‘disaster mitigation efforts’’ or ‘‘migration routes for African antelope’’ yields far better results than queries for keywords like ‘‘neutrinos’’ or ‘‘viral replication’’. We concluded that managing user expectations, along with the user’s own common sense, would play a role in user satisfaction with a geo-referenced view of bibliographic metadata search results. But we also believed that more work was necessary to span the gap between the explicit geographic information included in the metadata, and geo- Figure 3. Digital library-based approach for GAT LIBRARY HI TECH NEWS Number 9/10 2010 reference information that was not explicitly tagged as such. GAT: a proactive approach to assemble topically focused collection Meanwhile, our team was beginning to explore how to deal retrieving and integrating metadata from various sources, with a goal of offering a suite of awareness tools that would expose these data either via a traditional search work flow, or in real time via automatically generated queries based upon the user’s context. The awareness tools would retrieve and display data based upon automatic queries based on text the user entered while they were composing content, whether for a blog post, discussion forum, email message, or other purpose, especially for users who would not have the time to explicitly perform searches, such as in the case of an emergency. To maximize the utility of these tools, we decided to create tools that would enable rapid assembly of topically focused collections, either by harvesting content from a specific topical repository, or by retrieving results of a query against our own large bibliographic metadata collection, and storing those results as a discrete searchable collection which could be explored using the various tools we would develop. We decided to map the harvested metadata records into RDF triples, and use a simple multi-layer architecture to support their exploration. A middleware service layer would allow query and structured content retrieval, but leave it to another application or layer to handle rendering, while a second service layer would combine results with a rendering tool suitable for end users. For the GAT, it made sense to have the middleware layer be able to return results in the Google maps XML marker format, and the KML format (Google) which is supported by Google Earth (see Figure 4), and other geospatial visualization tools. As the architecture (see Figure 5) and toolset took shape, we returned to the problem of how we might improve the percentage of records that included georeference metadata. We knew from cursory inspection of the harvested records that many more records had the potential to be explored geospatially 7 Figure 4. Google Earth View for GAT Figure 5. Architecture for GAT: raw metadata is converted and augmented to support various geographic visualization techniques than those that explicitly included geo-reference data. Approach adopted to identify location names from text sources Two rich potential sources of place information were titles and abstracts, but finding locations in this data would 8 require sophisticated text processing capabilities. We explored the Yahoo term extraction service, but while it was capable of sometimes identifying a place name as a term of significance, it was not able to tell us when a particular piece of information was a place. We had been exploring both the Yahoo term extraction service, and Calais, to see if they improved automatic query construction for the awareness capabilities. Calais, now OpenCalais, provided a much more detailed analysis of content provided to it. ‘‘OpenCalais uses natural language processing (NLP) to read an article, extracting the ‘who, what, when, where and how’’’ for textual content (Thomas). One category of information it was able to identify in free text content were locations. Not only was it able to identify locations as such, it also returned the geographic coordinates for the locations. We found that combining location information explicitly provided in the metadata we harvested with georeference information identified by calls to OpenCalais, vastly improved the utility of our GAT. Indeed, the number of records that were augmented in this fashion that included geo-reference data increased by one or two orders of magnitude, depending on the topic of the search or repository contents (will include a table here). OpenCalais supports several interfaces for submitting requests. We chose to use the REST API and submitted both the object title and its abstract as the value for the content parameter, which is the mechanism for submitting text to Calais for parsing and entity extraction. Calais requires a minimum of 100 characters in order to determine the content’s language and have enough context in order to identify entities such as people and places. The upper limit on a content value is 100,000 characters, making it ideal for abstract content analysis. The request API is easy to use, though limited to 50,000 requests per day. The response can be returned in several formats. We elected to receive RDF/XML responses, which we parsed with Xpath expressions to locate linked data that was returned by the service. We used place data to build local georeference triples that correspond to the coordinates identified by Calais. Calais in turn relies on other linked data services such as geo-names, for this data. It excels at entity extraction, with a strong emphasis on corporate data. Discussion Since our work involves fusion of data from disparate sources, RDF is the format to which we normalize all our data. In the linked data web, it is possible, LIBRARY HI TECH NEWS Number 9/10 2010 indeed encouraged, that one reference other linked data, rather than duplicating effort. On the one hand, this makes sense, because, as with the data normalization problem in relational databases, duplicating data may yield performance advantages but it also leaves the door open to inevitable data synchronization problems. On the other hand, it may be crucial that the coordinates for a geographic location be explicitly captured and stored locally in triples associated with other aspects of the metadata object. Geo-political lines are notorious for being redrawn over time, so while it might sound like a good idea to link to a (Dbpedia) record about Copenhagen, Denmark, what happens if the ‘‘Danes’’ redraw the lines of their city? Is the appropriate geographic locale the original boundaries of the city, or is it sufficient to point to Copenhagen? And this does not take into account additional geographic dimensions such as elevation or depth. There are plants for example, which occur within a very narrow climatic band at a particular elevation on a single mountain, and this is not an uncommon situation, and yet that is crucial information for researchers. Augmenting metadata is a tricky proposition, but with the low number of bibliographic records which have been explicitly cataloged with geographic data, it is essential if you want to support georeference querying and browsing of the data. There are a couple of issues to keep in mind, however. One is that the granularity of geo-coding is still rather low, even with Calais. Geo-coordinates are typically provided as a single pair of longitude and latitude coordinates, representing a point on the globe, rather than an actual outline of a region. Ironically, the precise coordinates might, in some cases, be sensitive information. For example, community of scientists who study Karst terrain often keep the precise locations of cave formations secret, either at the request of a private LIBRARY HI TECH NEWS Number 9/10 2010 land owner, or to prevent curious spelunkers from visiting an ecologically sensitive cave. Yet, the pinpoint coordinates offered by geo-names or OpenCalais are not always terribly revealing. If for example, a bibliographic record describes a study of amphibians in California, the coordinates returned for ‘‘Calerveras County’’ will likely be the center of that county. This may be enough for some users, but not enough for others, who might need a set of coordinates that define an arbitrary, but significant region of the globe, that is related to the object being described by the metadata. Conclusion Many libraries have substantial collections that would lend themselves to geographic exploration, and end users are becoming increasingly comfortable dealing with data in the context of physical places. In cases where the metadata for these collections lacks explicit geo-referenced data, augmentation via entity extraction can fill this gap. Libraries may choose to provide browse and search capabilities in which results are overlaid onto maps, but there are other delivery strategies worth considering. JITIR techniques allows geo-referenced data to be retrieved and supplied to location-aware mobile devices, for example, in support of museum tours or field research. In the context of e-science, there is utility to offering services that expose results as raw geo-encoded metadata, since geo-encoded metadata may be of more value to researchers already equipped to deal with large scale data analysis, including analysis of geoencoded data. REFERENCES DBpedia (n.d.), available at: http://dbpedia. org/About GeoNames (n.d.), www.geonames.org/ available at: Google (n.d.), Keyhole Markup Language Developer’s Guide, available at: http:// code.google.com/apis/kml/documentation/ topicsinkml.html Library of Congress (n.d.), ‘‘034 – coded cartographic mathematical data’’, available at: www.loc.gov/marc/bibliographic/concise/ bd034.html OpenCalais (n.d.), opencalais.com/ available at: www. Petras, V. ‘‘Statistical analysis of geographic and language clues in the MARC record’’, available at: http://metadata.sims.berkeley. edu/papers/Marcplaces.pdf Rhodes, B. and Maes, P. (2000), ‘‘Just-inTime information retrieval agents’’, IBM Systems Journal, Vol. 39 Nos 3/4, pp. 685704. Sompel, H., Bekaert, J., Liu, X., Balakireva, L. and Schwander, T. (2005), ‘‘aDORe: a modular, standards-based digital object repository’’, The Computer Journal, Vol. 48 No. 5, pp. 514-35, available at: http://arxiv. org/abs/cs.DL/0502028 Thomas, K. (n.d.), ‘‘Thomson Reuters OpenCalais service adopted by the Huffington Post, DailyMe and Associated Newspapers’’, available at: www.opencalais. com/press-releases/thomson-reuters-opencal ais-service-adopted-huffington-post-dailymeand-associated-new James Powell (jepowell@lanl.gov) is a Research Technologist at the Research Library, Los Alamos National Laboratory, Los Alamos, New Mexico, USA. Ketan Mane (kmane@renci.org) is a Senior Research Informatics Developer at the Renaissance Computing Institute (RENCI), University of North Carolina, Chapel Hill, North Carolina, USA. Linn Marks Collins (linn@lanl.gov), Mark L.B. Martinez (mlbm@lanl.gov) and Tamara McMahon (tmcmahon@ lanl.gov) are based at the Research Library, Los Alamos National Laboratory, Los Alamos, New Mexico, USA. 9