CS/INFO 430 Information Retrieval Lecture 21 Metadata 3 1 Course Administration 2 Automatic extraction of catalog data Strategies 3 • Manual by trained cataloguers and indexers - high quality records, but expensive and time consuming • Entirely automatic - fast, almost zero cost, but poor quality • Automatic, followed by human editing - cost and quality depend on the amount of editing • Manual collection level record, automatic item level record - moderate quality, moderate cost DC-dot DC-dot is a Dublin Core metadata editor for Web pages, created by Andy Powell at UKOLN http://www.ukoln.ac.uk/metadata/dcdot/ DC-dot has two parts: (a) A skeleton Dublin Core record is created automatically from clues in the web page (b) A user interface is provided for cataloguers to edit the record 4 5 Automatic record for CS 430 home page DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/ DC.Title CS 430: Information Discovery DC.Subject wya@cs.cornell.edu; Course Structure; Readings and references; Slides; Basic Information; William Y. Arms; Information Retrieval Data Structures and Algorithms; cs430@cs.cornell.edu; Assignments; Syllabus; Text Book; Laptop computers; Assumed Background; Nomadic Computing Experiment; Notices; Course Description; Code of practice; Assignments and Grading; Last changed: February 6, 2001 continued on next slide 6 Automatic record for CS 430 home page (continued) DC.Publisher Cornell University DC.Date qualifier W3CDTF content 2001-02-07 DC.Type qualifier DCMIType content Text DC.Format text/html DC.Format 5781 bytes DC.Identifier http://www.cs.cornell.edu/courses/cs430/2001sp/ 7 Observations on DC-dot applied to CS430 home page DC.Title is a copy of the html <title> field DC.Publisher is the owner of the IP address where the page was stored DC.Subject is a list of headings and noun phrases presented for editing DC.Date is taken from the Last-Modified field in the http header DC.Type and DC.Format are taken from the MIME type of the http response DC.Identifier was supplied by the user as input 8 9 Observations on DC-dot applied to George W. Bush home page The home page has several meta tags: <META NAME="TITLE" CONTENT="George W. Bush for President"> [The page has no html <title>] <META NAME="CONTACT" CONTENT="George W Bush Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 6372000"> <META NAME="DESCRIPTION" CONTENT="George W. Bush is running for President of the United States to keep the country prosperous."> <META NAME="KEYWORDS" CONTENT="George W. Bush, Bush, George Bush, President, republican, 2000 election and more 10 Automatic record for George W. Bush home page DC-dot applied to http://www.georgewbush.com/ DC.Subject George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government DC.Description George W. Bush is running for President of the United States to keep the country prosperous. continued on next slide 11 Automatic record for George W. Bush home page (continued) DC.Publisher Concentric Network Corporation DC.Date qualifier W3CDTF content 2001-01-12 DC.Type qualifier DCMIType content Text DC.Format text/html DC.Format 12223 bytes DC.Identifier http://www.georgewbush.com/ 12 13 Metadata extracted automatically by DC-dot D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF 2000-05-11 type DCMIType Text format text/html format 27718 bytes 14 identifier http://www.dlib.org/dlib/january00/01levy.html Use of Machine Learning for Automatic Metadata Extraction In the previous example, several fields were not recognized that are easily identified by a human, e.g., author, and information about the serial in which the article appeared. iVia is the most advanced of several systems that apply machine learning algorithms to extract metadata automatically. See: http://ivia.ucr.edu/projects/. The state-of-the-art is: • automatic extraction followed by human editing • expert guided Web crawling. 15 Automatic Extraction and Search Engine Spam D-Lib Magazine Web pages created for user, with good quality control and no attempt to impress search engines. (The editor originally trained as a librarian.) The site lends itself to automatic indexing. Political Web Sites (Bush and Gore) Web pages created for marketing, with little consistency, designed to impress search engines. (The editors are specialists in public relations.) The sites are difficult to index automatically. 16 Metatest Metatest is a research project led by Liz Liddy at Syracuse with participation from the Human Computer Interaction group at Cornell. The aim is to compare the effectiveness as perceived by the user of indexing based on: (a) Manually created Dublin Core (b) Automatically created Dublin Core (higher quality than DC-dot) (c) Full text indexing Preliminary results suggest remarkably little difference in effectiveness. 17 Non-textual Materials: Examples Content maps photograph bird songs and images software data set video 18 Attribute lat. and long., content subject, date and place field mark, bird song task, algorithm survey characteristics subject, date, etc. Possible Approaches to Information Discovery for Non-text Materials Human indexing Manually created metadata records Automated information retrieval Automatically created metadata records (e.g., image recognition) Context: associated text, links, etc. (e.g., Google image search) Multimodal: combine information from several sources User expertise Browsing: user interface design 19 Catalog Records for Non-Textual Materials • General metadata standards, such as Dublin Core and MARC, can be used to create a textual catalog record of nontextual items. • Subject based metadata standards apply to specific categories of materials, e.g., FGDC for geospatial materials. Text-based searching methods can be used to search these catalog records. 20 Example 1: Photographs Photographs in the Library of Congress's American Memory collections In American Memory, each photograph is described by a MARC record. The photographs are grouped into collections, e.g., The Northern Great Plains, 1880-1920: Photographs from the Fred Hultstrand and F.A. Pazandak Photograph Collections Information discovery is by: • searching the catalog records • browsing the collections 21 22 23 24 Photographs: Cataloguing Difficulties Automatic • Image recognition methods are very primitive Manual • Photographic collections can be very large • Many photographs may show the same subject • Photographs have little or no internal metadata (no title page) • The subject of a photograph may not be known (Who are the people in a picture? Where is the location?) 25 Photographs: Difficulties for Users Searching • Often difficult to narrow the selection down by searching -browsing is required • Criteria may be different from those in catalog (e.g., graphical characteristics) Browsing 26 • Offline. Handling many photographs is tedious. Photographs can be damaged by repeated handling • Online. Viewing many images can be tedious. Screen quality may be inadequate. Example 2: Geospatial Information Example: Alexandria Digital Library at the University of California, Santa Barbara • Funded by the NSF Digital Libraries Initiative since 1994. • Collections include any data referenced by a geographical footprint. terrestrial maps, aerial and satellite photographs, astronomical maps, databases, related textual information • Program of research with practical implementation at the university's map library 27 Alexandria User Interface 28 Alexandria: Computer Systems and User Interfaces Computer systems • Digitized maps and geospatial information -- large files • Wavelets provide multi-level decomposition of image -> first level is a small coarse image -> extra levels provide greater detail User interfaces • Small size of computer displays • Slow performance of Internet in delivering large files -> retain state throughout a session 29 Alexandria: Information Discovery Metadata for information discovery Coverage: geographical area covered, such as the city of Santa Barbara or the Pacific Ocean. Scope: varieties of information, such as topographical features, political boundaries, or population density. Latitude and longitude provide basic metadata for maps and for geographical features. 30 Gazetteer Gazetteer: database and a set of procedures that translate representations of geospatial references: place names, geographic features, coordinates postal codes, census tracts Search engine tailored to peculiarities of searching for place names. Research is making steady progress at feature extraction, using automatic programs to identify objects in aerial photographs or printed maps -- topic for long-term research. 31 Gazetteers The Alexandria Digital Library (ADL): geolibrary at University of California at Santa Barbara where a primary attribute of objects is location on Earth (e.g., map, satellite photograph). Geographic footprint: latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary. Gazetteer: list of geographic names, with geographic locations and other descriptive information. Geographic name: proper name for a geographic place or feature (e.g., Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California) 32 Use of a Gazetteer • Answers the "Where is" question; for example, "Where is Santa Barbara?" • Translates between geographic names and locations. A user can find objects by matching the footprint of a geographic name to the footprints of the collection objects. • 33 Locates particular types of geographic features in a designated area. For example, a user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes in the area. Alexandria Gazetteer: Example from a search on "Tulsa" Feature name State County Type Latitude Tulsa OK Tulsa pop pl 360914N 0955933W Tulsa Country Club OK Osage locale 360958N 0960012W Tulsa County OK Tulsa civil 360600N 0955400W airport 360500N 0955205W Tulsa Helicopters OK Tulsa Incorporated Heliport 34 Longitude Challenges for the Alexandria Gazetteer Content standard: A standard conceptual schema for gazetteer information. Feature types: A type scheme to categorize individual features, is rich in term variants and extensible. Temporal aspects: Geographic names and attributes change through time. "Fuzzy" footprints: Extent of a geographic feature is often approximate or ill-defined (e.g., Southern California). 35 Challenges for the Alexandria Gazetteer (continued) Quality aspects: (a) Indicate the accuracy of latitude and longitude data. (b) Ensure that the reported coordinates agree with the other elements of the description. Spatial extents: (a) Points do not represent the extent of the geographic locations and are therefore only minimally useful. (b) Bounding boxes, often include too much territory (e.g., the bounding box for California also includes Nevada). 36 Alexandria Thesaurus: Example canals A feature type category for places such as the Erie Canal. Used for: The category canals is used instead of any of the following. canal bends ditches canalized streams drainage canals ditch mouths drainage ditches Broader Terms: Canals is a sub-type of hydrographic structures. 37 ... more ... Alexandria Thesaurus: Example (continued) canals (continued) Related Terms: The following is a list of other categories related to canals (nonhierarchial relationships). channels locks transportation features tunnels Scope Note: Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals. 38 Alexandria Gazetteer Alexandria Digital Library Linda L. Hill, James Frew, and Qi Zheng, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5: 1, January 1999. http://www.dlib.org/dlib/january99/hill/01hill.html 39