CLiMB: Computational Linguistics for Metadata Building

advertisement
CLiMB:
Computational Linguistics
for
Metadata Building
Center for Research on Information Access
Columbia University Libraries
Problems in Image Access


Libraries have the challenge of
cataloging large scholarly collections
of images.
Users might search for:



7/27/2016
a specific material (marble)
a specific subject (cup of wine, satyr)
Traditional approaches use manual
expertise:

slow

expensive

often limited in scope
CLiMB: Computational Linguistics for Metadata Building
2
CLiMB Technical Contribution
CLiMB will identify and extract detailed
➢ proper nouns
➢ terms and phrases
from text related to an image:
Messer Iacopo Galli, a Roman gentleman of good
understanding, made Michelangelo carve a marble
Bacchus, ten palms in height, in his house; this work
in form and bearing in every part corresponds to the
description of the ancient writers – his aspect, merry;
the eyes squinting and lascivious, like those of people
excessively given to the love of wine. He holds a cup
in his right hand, like one about to drink, and looks at
it lovingly, taking pleasure in the liquor of which he
was the inventor; for this reason he is crowned with a
garland of vine leaves.
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
3
CLiMB Outcomes





Research: Development of richer retrieval through increased
numbers of descriptors
Research and Practice: Creation of enabling technologies for
new large digitization projects
Research and Practice: Expand capability for cross-collection
searching
Practice: Development of suite of CLiMB tools
Resources: Vocabulary list which can be used by other visual
resource professionals
The essence of CLiMB:

Use scholars themselves as “catalogers” by utilizing scholarly
publications

Enhance existing descriptive metadata
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
4
CLiMB Progress

CLiMB Teams

Tools and Technology Development

Image Collections

Evaluation

Future Plans
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
5
CLiMB Progress

CLiMB Teams

Tools and Technology Development

Image Collections

Evaluation

Future Plans
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
6
CLiMB: Interdisciplinary Research
Funded by Mellon Foundation 2002-2004






Center for Research on Information Access
Computer Science Dept
Libraries Special Collections
–
Avery Architectural and Fine Arts Library – 4000 images
–
Greene & Greene
Starr East Asian Library – Chinese Paper Gods
–
South Asian Collections – South Asian Temples
Library Systems Office
Electronic Text Service
Libraries Digital Program Division
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
7
CLiMB Project Teams
Coordinating
Collections
(Curatorial)
Technical
External
Advisory
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
8
CliMB:
2 year timetable


YEAR 1
– Evaluating existing computational tools
– Developing additional software as needed
– Selecting and building (scanning, converting) needed
candidate texts
– Loading initial descriptive metadata into end-user system
– Evaluating initial results with user groups
YEAR 2
– Use feedback to refine metadata generation & filtering
– Prepare additional collections for testing
– Incorporate data in different user platforms
– Seek external partners for using CLiMB toolset
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
13
CLiMB Progress

CLiMB Teams

Tools and Technology Development
1. Find important words, phrases, proper nouns
•
Use existing controlled and uncontrolled vocabularies
•
Filter and refine
2. Segment long texts to give the user relevant
information with the image
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
20
Text from Bosley 2000
By September 14, 1908, the basis of the
Greenes' final design had been worked out. It
featured a radically informal, V-shaped plan
(that maintained the original angled porch)
and interior volumes of various heights, all
under a constantly changing roofline that
echoed the rise and fall of the mountains
behind it. The chimneys and foundation
would be constructed of the sandstone
boulders that comprised the local geology,
and the exterior of the house would be
sheathed in stained split-redwood shakes.
Since Charles Pratt had become part-owner
of the nearby Foothills Hotel, he and his wife
took most of their meals and entertained
there. Accordingly, they did not require
spacious public rooms for socializing in their
new house. Indeed, they reportedly used the
house only as "sleeping quarters."
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
23
Our Goal (Manual Example)
By September 14, 1908, the basis of the
Greenes' final design had been worked out. It
featured a radically informal, V-shaped plan
(that maintained the original angled porch)
and interior volumes of various heights, all
under a constantly changing roofline that
echoed the rise and fall of the mountains
behind it. The chimneys and foundation
would be constructed of the sandstone
boulders that comprised the local geology,
and the exterior of the house would be
sheathed in stained split-redwood shakes.
Since Charles Pratt had become part-owner
of the nearby Foothills Hotel, he and his wife
took most of their meals and entertained
there. Accordingly, they did not require
spacious public rooms for socializing in their
new house. Indeed, they reportedly used the
house only as "sleeping quarters."
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
24
Existing Metadata
•Project Headings
•Charles Millard Pratt House
(Nordhoff, CA)
•Topical Subject Headings
•Porches (AAT)
•Garages (AAT)
•Personal Name Headings
•Pratt, Charles Millard
•Locality Headings
•Nordhoff, CA (Avery)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
26
Existing Metadata
•Material headings
•Crayon drawings (AAT)
•Corporate Name Headings
•George E. Richardson Plumbing
(Avery)
•Genre Headings
•Elevations (AAT)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
27
Existing Metadata Matched in Text
By September 14, 1908, the basis of the
Greenes' final design had been worked out. It
featured a radically informal, V-shaped plan
(that maintained the original angled porch)
and interior volumes of various heights, all
under a constantly changing roofline that
echoed the rise and fall of the mountains
behind it. The chimneys and foundation
would be constructed of the sandstone
boulders that comprised the local geology,
and the exterior of the house would be
sheathed in stained split-redwood shakes.
Since Charles Pratt had become part-owner
of the nearby Foothills Hotel, he and his wife
took most of their meals and entertained
there. Accordingly, they did not require
spacious public rooms for socializing in their
new house. Indeed, they reportedly used the
house only as "sleeping quarters."
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
28
Proper Nouns
Automatically Extracted from Text
By September 14, 1908, the basis of the
Greenes' final design had been worked out. It
featured a radically informal, V-shaped plan
(that maintained the original angled porch)
and interior volumes of various heights, all
under a constantly changing roofline that
echoed the rise and fall of the mountains
behind it. The chimneys and foundation
would be constructed of the sandstone
boulders that comprised the local geology,
and the exterior of the house would be
sheathed in stained split-redwood shakes.
Since Charles Pratt had become part-owner
of the nearby Foothills Hotel, he and his wife
took most of their meals and entertained
there. Accordingly, they did not require
spacious public rooms for socializing in their
new house. Indeed, they reportedly used the
house only as "sleeping quarters."
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
31
Sample of Automatically Produced
Strings (unfiltered)

1908, the

Of the Greenes

V-shaped plan

Roofline

Would be constructed

Sandstone

Of the house

Redwood
(basic collocations without filtering, no noun
phrases)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
34
CLiMB: Results of Filtering
(sample list)

Greenes
 V-shaped plan
 Roofline
 Sandstone
 Redwood
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
37
Future: Use Authority Files

Varied sources
–
Art and Architecture Thesaurus (http://www.gii.getty.edu/vocabulary/aat.html)
– Library of Congress Subject Headings (LCSH)
– Library of Congress Thesaurus for Graphic Materials (LCTGM)
–
Getty Thesaurus of Geographic Names (http://www.gii.getty.edu/vocabulary/tgn.html)
–
Back-of-the-book indexes
– Tables of contents

Incorporate noun phrase chunking

Find related terms that we may have missed

Use in conjunction with Subject vocabularies
– Collocation
–
Bootstrapping (using existing lists to help guess unknown terms)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
38
Current CLiMB results
By September 14, 1908, the basis of the
Greenes' final design had been worked out. It
featured a radically informal, V-shaped plan
(that maintained the original angled porch)
and interior volumes of various heights, all
under a constantly changing roofline that
echoed the rise and fall of the mountains
behind it. The chimneys and foundation
would be constructed of the sandstone
boulders that comprised the local geology,
and the exterior of the house would be
sheathed in stained split-redwood shakes.
Since Charles Pratt had become part-owner
of the nearby Foothills Hotel, he and his wife
took most of their meals and entertained
there. Accordingly, they did not require
spacious public rooms for socializing in their
new house. Indeed, they reportedly used the
house only as "sleeping quarters."
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
40
Evaluation of Techniques

How well does the suite of CLiMB tools compare with the
human expert?
Task: Have experts mark up text
Then, compare:


Recall = you found everything that you were supposed to (even
though you may also have incorrect results)
Precision = everything you found was correct (even if you did not find
everything)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
41
Precision
Recall
Tradeoff between precision and recall.
CLiMB goal - find where our best results will be.
Evaluation with users.
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
42
CLiMB Progress

CLiMB Teams

Tools and Technology Development
1. Find important words, phrases, proper nouns
•
Use existing controlled and uncontrolled vocabularies
•
Filter and refine
2. Segment long texts to give the user relevant
information with the image
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
46
Bringing Images and Text to the User

A user is probably interested in portions of a
document relevant to an image (when permissions
allow)

Segmentation separates a document to hone in on
subject-specific portions
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
47
Segmentation Technique
7/27/2016
Project People, Frequency
12
10
Cole
Bolton
Thorsen
Pratt
Gamble
Blacker
Robinson
Ford
8
6
4
2
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
0
1

Use the frequency
that our terms
appear within a
document to
estimate when the
document is about
that term
This graph shows
where different
names are
mentioned in
Bosley on Greene
& Greene
Frequency

Paragraph
CLiMB: Computational Linguistics for Metadata Building
48
CLiMB Progress

CLiMB Teams

Tools and Technology Development

Image Collections

Evaluation

Future Plans
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
49
CLiMB Progress

CLiMB Teams

Tools and Technology Development

Image Collections
1. Defined criteria for selecting images and text
2. Identified three collections with varying complexity
3. Scanned limited material as needed for research

Evaluation

Future Plans
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
51
Criteria for Choosing Collections

Sources
– Collections of images
– Text about images

Type of text
– Tightly associated – e.g South Asian Temples
– Loosely associated – e.g. Greene & Greene
– Somewhere in between – e.g. Chinese Paper Gods

Rights and Permissions

Language – English
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
52
Collections
1.
2.
Greene and Greene – Architecture, large image collection owned by Avery
–
Bosley, E. Greene & Greene. London: Phaidon Press, Inc., 2000
–
Current, W. Greene & Greene: Architects in the Residential Style. Fort Worth [Tex.]
Amon Carter Museum of Western Art, 1974
–
Smith & Vertikoff, Greene & Greene Masterworks. San Francisco : Chronicle Books,
1998
–
Makinson, R. Greene & Greene: Architecture as a Fine Art. Salt Lake City : Peregrine
Smith, 1977
–
Makinson, R. 1998. Greene & Greene: The Passion and the Legacy. Salt Lake City :
Gibbs Smith, 1998.
–
Strand, J. A Greene & Greene Guide. Pasadena, Calif., 1974
Chinese Paper Gods – Fragile paper with descriptions
–
3.
Goodrich, Anne. Peking Paper Gods: A Look at Home Worship. Nettetal: SteylerVerl., 1991
South Asian Temples – Images with descriptive text
–
Archaeological Survey of India. Western Circle. Progress report. 1898. Calcutta
(Digital South Asia Library http://dsal.uchicago.edu/books/)
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
53
Example of Tightly Connected Text
South Asian Temples

Each image is accompanied by a set of welldefined, consistent descriptions

Very little extraneous information

Limited existing metadata

But: sites are complex and hierarchical
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
54
17. Still continuing eastward we found some interesting remains
at Velapur. Here by the roadside, just outside the village, is a plain
but well preserved old stone temple with a well built dharamsala
or rest-house beside it. Around the temple, set up in the ground,
and all more or less buried by the additional
accumulation of the earth of ages, are about twenty well carved
viragals or memorial stones. There are seven in one line which
were almost half buried while the rest are scattered about. They
represent battle scenes where the hero distinguishes and
extinguishes himself, and linga worship. They are a very
interesting collection, but are uncared for.… At the side of the
steps leading down to a square tank in front is an inscription
which records the setting up of a kalasa by Brahmadevarana, a
subordinate chief under the king Praudhapratapachakravartin Sri
Ramachandradeva in Saka 1227…. Just inside the eastern
gateway of the village is a large slab bearing a representation of
Gaja-Lakshmi.
See photographs 1549, 1550, 1551, 1552, 1553 and 1554.
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
55
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
56
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
57
Example of Loosely Connected Text
for Architecture

Bosley, Greene & Greene

Text describes general architectural trends as well
as details about specific objects

Photographs are of rooms or large sections of the
house

Text contains a wide variety of information
7/27/2016
–
Biographical background of clients
–
Details about construction
–
Feedback from clients
CLiMB: Computational Linguistics for Metadata Building
61
Chinese Paper Gods
A print called Wu-lu chih-shen (Gods of
the Five Roads) is an 11 x 12” print.
There is the usual red panel for the title at
the top and the flanking green panels.
The only other color is a block of pink in
the top center. The picture is entirely
filled with figures of five men and their
horses. The men all carry swords, wear
ordinary clothes and round caps. They
are all clean-shaven. The belief in the
Wu-lu Gods of Wealth goes back at least
to the T’ang Dynasty as figurines of that
period labeled Wu-lu have been found.
Goodrich, Anne. Peking Paper Gods: A
Look at Home Worship. Nettetal: SteylerVerl., 1991, pp95-96.
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
64
Current Metadata (selected)











Record Type: Image
Type: Digital Image
Author/Artist: Unknown
Style: Pre-Cultural
Revolution
Culture: Chinese
Subject: Deity, Chinese
Paper God
Material: Ink, color on paper
Technique: Relief print,
woodcut
Length, inches: 13.5
Width, inches: 12
Title: Kuan Yen
Notes on items: On recto, "Nainai not recog. Kuan
Yen.To burn." On verso, "Notes 24“
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
65
Research Question: What counts as
“Associated Text” and Is It Useful?






7/27/2016
What does it mean for text to be associated with an image or
an object represented by an image?
Ways to determine relatedness
– Context
– Other markup (paragraphs, chapters)
We are developing ways to identify which paragraphs are
relevant to given images
Key role of proper nouns
Research question: is low confidence metadata better than
none?
Let the user inform us…
CLiMB: Computational Linguistics for Metadata Building
67
CLiMB Progress

CLiMB Teams

Tools and Technology Development

Image Collections

Evaluation

Future Plans
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
68
How do we know success?
1. Does the software find what experts judge we
should find?
2. Can people find more images with fewer steps than
with current methods?
3. Is this more cost effective than traditional
cataloging?
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
69
Do CLiMB tools find what experts
judge we should find?

Build test sets with controlled vocabulary by expert
catalogers

Build test sets with uncontrolled vocabulary by art
historians

Test to see how well our software finds what we
want it to find
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
70
Can people find more images with
fewer steps than with current
methods?

Embed results in image search platforms

Test with users

Give tasks for which the image is the answer to the
question
– How do people search given only controlled vocabulary?
– How can they search (and find) given larger vocabulary?
– What is confusing? What is helpful?
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
71
Next Steps

Improve extraction and filtering of terms, phrases, proper
nouns

Make more authoritative standards for evaluation

Integrate into standard image search platform
– Start with Luna Insight

Test initial CLiMB data with users
– Design experiments to find out how well we are doing

Improve

Improve

Improve
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
73
Thank you!
7/27/2016
CLiMB: Computational Linguistics for Metadata Building
74
Download