CLiMB: Computational Linguistics for Metadata Building Center for Research on Information Access Columbia University Libraries Problems in Image Access Libraries have the challenge of cataloging large scholarly collections of images. Users might search for: 7/27/2016 a specific material (marble) a specific subject (cup of wine, satyr) Traditional approaches use manual expertise: slow expensive often limited in scope CLiMB: Computational Linguistics for Metadata Building 2 CLiMB Technical Contribution CLiMB will identify and extract detailed ➢ proper nouns ➢ terms and phrases from text related to an image: Messer Iacopo Galli, a Roman gentleman of good understanding, made Michelangelo carve a marble Bacchus, ten palms in height, in his house; this work in form and bearing in every part corresponds to the description of the ancient writers – his aspect, merry; the eyes squinting and lascivious, like those of people excessively given to the love of wine. He holds a cup in his right hand, like one about to drink, and looks at it lovingly, taking pleasure in the liquor of which he was the inventor; for this reason he is crowned with a garland of vine leaves. 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 3 CLiMB Outcomes Research: Development of richer retrieval through increased numbers of descriptors Research and Practice: Creation of enabling technologies for new large digitization projects Research and Practice: Expand capability for cross-collection searching Practice: Development of suite of CLiMB tools Resources: Vocabulary list which can be used by other visual resource professionals The essence of CLiMB: Use scholars themselves as “catalogers” by utilizing scholarly publications Enhance existing descriptive metadata 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 4 CLiMB Progress CLiMB Teams Tools and Technology Development Image Collections Evaluation Future Plans 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 5 CLiMB Progress CLiMB Teams Tools and Technology Development Image Collections Evaluation Future Plans 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 6 CLiMB: Interdisciplinary Research Funded by Mellon Foundation 2002-2004 Center for Research on Information Access Computer Science Dept Libraries Special Collections – Avery Architectural and Fine Arts Library – 4000 images – Greene & Greene Starr East Asian Library – Chinese Paper Gods – South Asian Collections – South Asian Temples Library Systems Office Electronic Text Service Libraries Digital Program Division 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 7 CLiMB Project Teams Coordinating Collections (Curatorial) Technical External Advisory 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 8 CliMB: 2 year timetable YEAR 1 – Evaluating existing computational tools – Developing additional software as needed – Selecting and building (scanning, converting) needed candidate texts – Loading initial descriptive metadata into end-user system – Evaluating initial results with user groups YEAR 2 – Use feedback to refine metadata generation & filtering – Prepare additional collections for testing – Incorporate data in different user platforms – Seek external partners for using CLiMB toolset 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 13 CLiMB Progress CLiMB Teams Tools and Technology Development 1. Find important words, phrases, proper nouns • Use existing controlled and uncontrolled vocabularies • Filter and refine 2. Segment long texts to give the user relevant information with the image 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 20 Text from Bosley 2000 By September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. Since Charles Pratt had become part-owner of the nearby Foothills Hotel, he and his wife took most of their meals and entertained there. Accordingly, they did not require spacious public rooms for socializing in their new house. Indeed, they reportedly used the house only as "sleeping quarters." 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 23 Our Goal (Manual Example) By September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. Since Charles Pratt had become part-owner of the nearby Foothills Hotel, he and his wife took most of their meals and entertained there. Accordingly, they did not require spacious public rooms for socializing in their new house. Indeed, they reportedly used the house only as "sleeping quarters." 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 24 Existing Metadata •Project Headings •Charles Millard Pratt House (Nordhoff, CA) •Topical Subject Headings •Porches (AAT) •Garages (AAT) •Personal Name Headings •Pratt, Charles Millard •Locality Headings •Nordhoff, CA (Avery) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 26 Existing Metadata •Material headings •Crayon drawings (AAT) •Corporate Name Headings •George E. Richardson Plumbing (Avery) •Genre Headings •Elevations (AAT) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 27 Existing Metadata Matched in Text By September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. Since Charles Pratt had become part-owner of the nearby Foothills Hotel, he and his wife took most of their meals and entertained there. Accordingly, they did not require spacious public rooms for socializing in their new house. Indeed, they reportedly used the house only as "sleeping quarters." 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 28 Proper Nouns Automatically Extracted from Text By September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. Since Charles Pratt had become part-owner of the nearby Foothills Hotel, he and his wife took most of their meals and entertained there. Accordingly, they did not require spacious public rooms for socializing in their new house. Indeed, they reportedly used the house only as "sleeping quarters." 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 31 Sample of Automatically Produced Strings (unfiltered) 1908, the Of the Greenes V-shaped plan Roofline Would be constructed Sandstone Of the house Redwood (basic collocations without filtering, no noun phrases) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 34 CLiMB: Results of Filtering (sample list) Greenes V-shaped plan Roofline Sandstone Redwood 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 37 Future: Use Authority Files Varied sources – Art and Architecture Thesaurus (http://www.gii.getty.edu/vocabulary/aat.html) – Library of Congress Subject Headings (LCSH) – Library of Congress Thesaurus for Graphic Materials (LCTGM) – Getty Thesaurus of Geographic Names (http://www.gii.getty.edu/vocabulary/tgn.html) – Back-of-the-book indexes – Tables of contents Incorporate noun phrase chunking Find related terms that we may have missed Use in conjunction with Subject vocabularies – Collocation – Bootstrapping (using existing lists to help guess unknown terms) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 38 Current CLiMB results By September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. Since Charles Pratt had become part-owner of the nearby Foothills Hotel, he and his wife took most of their meals and entertained there. Accordingly, they did not require spacious public rooms for socializing in their new house. Indeed, they reportedly used the house only as "sleeping quarters." 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 40 Evaluation of Techniques How well does the suite of CLiMB tools compare with the human expert? Task: Have experts mark up text Then, compare: Recall = you found everything that you were supposed to (even though you may also have incorrect results) Precision = everything you found was correct (even if you did not find everything) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 41 Precision Recall Tradeoff between precision and recall. CLiMB goal - find where our best results will be. Evaluation with users. 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 42 CLiMB Progress CLiMB Teams Tools and Technology Development 1. Find important words, phrases, proper nouns • Use existing controlled and uncontrolled vocabularies • Filter and refine 2. Segment long texts to give the user relevant information with the image 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 46 Bringing Images and Text to the User A user is probably interested in portions of a document relevant to an image (when permissions allow) Segmentation separates a document to hone in on subject-specific portions 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 47 Segmentation Technique 7/27/2016 Project People, Frequency 12 10 Cole Bolton Thorsen Pratt Gamble Blacker Robinson Ford 8 6 4 2 49 46 43 40 37 34 31 28 25 22 19 16 13 10 7 4 0 1 Use the frequency that our terms appear within a document to estimate when the document is about that term This graph shows where different names are mentioned in Bosley on Greene & Greene Frequency Paragraph CLiMB: Computational Linguistics for Metadata Building 48 CLiMB Progress CLiMB Teams Tools and Technology Development Image Collections Evaluation Future Plans 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 49 CLiMB Progress CLiMB Teams Tools and Technology Development Image Collections 1. Defined criteria for selecting images and text 2. Identified three collections with varying complexity 3. Scanned limited material as needed for research Evaluation Future Plans 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 51 Criteria for Choosing Collections Sources – Collections of images – Text about images Type of text – Tightly associated – e.g South Asian Temples – Loosely associated – e.g. Greene & Greene – Somewhere in between – e.g. Chinese Paper Gods Rights and Permissions Language – English 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 52 Collections 1. 2. Greene and Greene – Architecture, large image collection owned by Avery – Bosley, E. Greene & Greene. London: Phaidon Press, Inc., 2000 – Current, W. Greene & Greene: Architects in the Residential Style. Fort Worth [Tex.] Amon Carter Museum of Western Art, 1974 – Smith & Vertikoff, Greene & Greene Masterworks. San Francisco : Chronicle Books, 1998 – Makinson, R. Greene & Greene: Architecture as a Fine Art. Salt Lake City : Peregrine Smith, 1977 – Makinson, R. 1998. Greene & Greene: The Passion and the Legacy. Salt Lake City : Gibbs Smith, 1998. – Strand, J. A Greene & Greene Guide. Pasadena, Calif., 1974 Chinese Paper Gods – Fragile paper with descriptions – 3. Goodrich, Anne. Peking Paper Gods: A Look at Home Worship. Nettetal: SteylerVerl., 1991 South Asian Temples – Images with descriptive text – Archaeological Survey of India. Western Circle. Progress report. 1898. Calcutta (Digital South Asia Library http://dsal.uchicago.edu/books/) 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 53 Example of Tightly Connected Text South Asian Temples Each image is accompanied by a set of welldefined, consistent descriptions Very little extraneous information Limited existing metadata But: sites are complex and hierarchical 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 54 17. Still continuing eastward we found some interesting remains at Velapur. Here by the roadside, just outside the village, is a plain but well preserved old stone temple with a well built dharamsala or rest-house beside it. Around the temple, set up in the ground, and all more or less buried by the additional accumulation of the earth of ages, are about twenty well carved viragals or memorial stones. There are seven in one line which were almost half buried while the rest are scattered about. They represent battle scenes where the hero distinguishes and extinguishes himself, and linga worship. They are a very interesting collection, but are uncared for.… At the side of the steps leading down to a square tank in front is an inscription which records the setting up of a kalasa by Brahmadevarana, a subordinate chief under the king Praudhapratapachakravartin Sri Ramachandradeva in Saka 1227…. Just inside the eastern gateway of the village is a large slab bearing a representation of Gaja-Lakshmi. See photographs 1549, 1550, 1551, 1552, 1553 and 1554. 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 55 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 56 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 57 Example of Loosely Connected Text for Architecture Bosley, Greene & Greene Text describes general architectural trends as well as details about specific objects Photographs are of rooms or large sections of the house Text contains a wide variety of information 7/27/2016 – Biographical background of clients – Details about construction – Feedback from clients CLiMB: Computational Linguistics for Metadata Building 61 Chinese Paper Gods A print called Wu-lu chih-shen (Gods of the Five Roads) is an 11 x 12” print. There is the usual red panel for the title at the top and the flanking green panels. The only other color is a block of pink in the top center. The picture is entirely filled with figures of five men and their horses. The men all carry swords, wear ordinary clothes and round caps. They are all clean-shaven. The belief in the Wu-lu Gods of Wealth goes back at least to the T’ang Dynasty as figurines of that period labeled Wu-lu have been found. Goodrich, Anne. Peking Paper Gods: A Look at Home Worship. Nettetal: SteylerVerl., 1991, pp95-96. 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 64 Current Metadata (selected) Record Type: Image Type: Digital Image Author/Artist: Unknown Style: Pre-Cultural Revolution Culture: Chinese Subject: Deity, Chinese Paper God Material: Ink, color on paper Technique: Relief print, woodcut Length, inches: 13.5 Width, inches: 12 Title: Kuan Yen Notes on items: On recto, "Nainai not recog. Kuan Yen.To burn." On verso, "Notes 24“ 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 65 Research Question: What counts as “Associated Text” and Is It Useful? 7/27/2016 What does it mean for text to be associated with an image or an object represented by an image? Ways to determine relatedness – Context – Other markup (paragraphs, chapters) We are developing ways to identify which paragraphs are relevant to given images Key role of proper nouns Research question: is low confidence metadata better than none? Let the user inform us… CLiMB: Computational Linguistics for Metadata Building 67 CLiMB Progress CLiMB Teams Tools and Technology Development Image Collections Evaluation Future Plans 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 68 How do we know success? 1. Does the software find what experts judge we should find? 2. Can people find more images with fewer steps than with current methods? 3. Is this more cost effective than traditional cataloging? 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 69 Do CLiMB tools find what experts judge we should find? Build test sets with controlled vocabulary by expert catalogers Build test sets with uncontrolled vocabulary by art historians Test to see how well our software finds what we want it to find 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 70 Can people find more images with fewer steps than with current methods? Embed results in image search platforms Test with users Give tasks for which the image is the answer to the question – How do people search given only controlled vocabulary? – How can they search (and find) given larger vocabulary? – What is confusing? What is helpful? 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 71 Next Steps Improve extraction and filtering of terms, phrases, proper nouns Make more authoritative standards for evaluation Integrate into standard image search platform – Start with Luna Insight Test initial CLiMB data with users – Design experiments to find out how well we are doing Improve Improve Improve 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 73 Thank you! 7/27/2016 CLiMB: Computational Linguistics for Metadata Building 74