CLiMB: Computational Linguistics for Metadata Building Center for Research on Information Access

advertisement
CLiMB:
Computational Linguistics for
Metadata Building
Center for Research on Information Access
Columbia University
Judith L. Klavans Libraries
1
Goals of Meeting
• Review progress since June 2003 meeting
– Advisory Board suggestions
– Select a new collection with narrow criteria
– Test results outside of image access platform
• Strategize for Next Steps
– Potential partners
– Driving questions
– Selection of project direction(s)
Judith L. Klavans
2
June 2003 to November 2003
Four areas
• Collections
• Technology
• Users and Uses
• Interface Tools
Judith L. Klavans
3
Problems in Image Access


Cataloging digital images
Traditional approach:
manual expertise



labor intensive
expensive
Can automated techniques help?
Judith L. Klavans
4
CLiMB Technical Contribution
CLiMB will identify and extract
• proper nouns
• terms and phrases
from text related to an image:
September 14, 1908, the basis of the Greenes' final
design had been worked out. It featured a radically
informal, V-shaped plan (that maintained the original
angled porch) and interior volumes of various heights,
all under a constantly changing roofline that echoed
the rise and fall of the mountains behind it. The
chimneys and foundation would be constructed of the
sandstone boulders that comprised the local geology,
and the exterior of the house would be sheathed in
stained split-redwood shakes. —Edward R. Bosley.
Greene & Greene. London : Phaidon, 2000. p. 127
Judith L. Klavans
5
Can we harvest image descriptors?
Judith L. Klavans
6
Progress and Planning
• Collections
• Technology
• Users and Uses
• Interface Tools
Judith L. Klavans
7
CLiMB Collections
• Greene & Greene Architectural Drawings
–
–
–
–
Complex images
Scholarly texts written about the projects
Loose association between text and image
Columbia owns many images
• Chinese Paper Gods
–
–
–
–
Less complex image
Lay description of each image
Small, valuable collection scanned for CLiMB
Multilingual transcription is non-standard and variable
Judith L. Klavans
8
Greene & Greene Architectural Records and
Papers Collection
Drawings and Archives
Avery Architectural and Fine Arts Library
Columbia University Libraries
Judith L. Klavans
9
Judith L. Klavans
10
Chinese Paper Gods
Anne S. Goodrich Collection
C.V. Starr East Asian Library,
Columbia University
Judith L. Klavans
11
Pan-hu chih-shen
God of tigers
Judith L. Klavans
12
New Collection: Desiderata
• Close association between text and image
• Scholarly descriptions well-structured for
testing NLP tools
• Clear Target Object Identifiers (TOIs)
• English only
• Intellectual Property Rights
Judith L. Klavans
13
Potential Choice
North Carolina Museum of Art :
Handbook of the Collections
Introduction, Lawrence J. Wheeler ; editor,
Rebecca Martin Nagy ; assisted by June
Spence ; contributors, Virgina Burden ... [et
al.]. Raleigh : The Museum ; New York, NY :
Distributed by Hudson Hills Press, 1998.
Judith L. Klavans
14
About the Collection
•
•
•
•
•
Available through Saskia
70 images
Good quality images and details
Well-structured delimited text descriptions
Rights management still need to be
addressed
Judith L. Klavans
15
Alex Katz
American, born 1927
Six Women, 1975
Oil on canvas
114 x 282 in.
Judith L. Klavans
16
Alex Katz has developed a remarkable hybrid art that combines the
aggressive scale and grandeur of modern abstract painting with a chic,
impersonal realism. During the 1950s and 1960s—decades dominated by
various modes of abstraction—Katz stubbornly upheld the validity of figurative
painting. In major, mature works such as Six Women, the artist distances himself
from his subject. Space is flattened, as are the personalities of the women, their
features simplified and idealized: Katz’s models are as fetching and vacuous as
cover girls. The artist paints them with the authority and license of a master
craftsman, but his brush conveys little emotion or personality. In contrast to the
turbulent paint effects favored by the abstract expressionist artists, Katz pacifies
the surface of his picture. Through the virtuosic technique of painting wet-onwet, he achieves a level and unifying smoothness. He further “cools” the image
by adopting the casually cropped composition and overpowering size and
indifference of a highway billboard or big-screen movie.
In Six Women, Katz portrays a gathering of young friends at his Soho loft.
The apparent informality of the scene is deceptive. It is, in fact, carefully staged.
Note the three pairs of figures: the foreground couple face each other; the
middle ground pair alternately look out and into the picture; and the pair in the
background stand at matching oblique angles. The artist also arranges the
women into two conversational triangles. Katz studied each model separately,
then artfully fit the models into the picture. The image suggests an actual event,
but the only true event is the play of light. From the open windows, a cordial
afternoon sunlight saturates the space, accenting the features of each woman.
Judith L. Klavans
http://ncartmuseum.org/collections/offviewcaptions.shtml#alex
17
Frank Philip Stella
American, born 1936
Raqqa II, 1970
Synthetic polymer and graphite on canvas
120 x 300 in.
Judith L. Klavans
18
To many artists of Frank Stella’s generation, the highly subjective paintings
of the abstract expressionists seemed mannered and self- indulgent. Stella’s
response was to systematize the abstract picture using geometry and a strict but
arbitrary set of procedures. Explaining that his art “is based on the fact that only
what can be seen there is there,” he sought to distill the image to paint and canvas
alone. He stripped his paintings of story or statement—even a brushstroke
conveyed too much personality. Stella methodically developed images in series,
first mapping the designs on paper before transferring them to canvas. Little was
left to chance. Raqqa II belongs to Stella’s aptly titled Protractor Series, begun in
1967. Though never completed, the series was to include 31 compositions, each to
be carried out in three different formats: interlaces, rainbows and fans. He titled the
paintings after ancient, circular-planned cities.
Raqqa II does not lie quietly on the wall. It dominates its surroundings. What
at first glance appears a childlike pattern is actually a highly complex exercise in
perception. Bright bands of flat color arc and overlap, promising an illusion of
receding space. However, their containment within a strict system of seven shaped
and framed units confounds that illusion. The monumental scale and aggressive
confidence of Raqqa II typify American art during the 1960s.
http://ncartmuseum.org/collections/offviewcaptions.shtml#frank
Judith L. Klavans
19
Progress and Planning
• Collections
• Technology
• Users and Uses
• Interface Tools
Judith L. Klavans
20
Text Analysis and Filtering
1. Divide text into words and phrases
2. Gather features for each word and phrase
•
E.g. Is it in the AAT? Is it very frequent?
3. Develop formulae using this information
4. Use formulae to rank for usefulness as
potential metadata
Judith L. Klavans
21
What Features do we Track?
• Lexical features
– Proper noun, common noun
• Relevancy to domain
– Text Object Identifier (TOI)
– Presence in the Art & Architecture Thesaurus
– Presence in the back-of-book index
• Statistical observations
– Frequency in the text
– Frequency across a larger set of texts, within and
outside the domain
Judith L. Klavans
22
Problem: Too much Data!
• How should the output be filtered?
• What filtering helps additional text
processing (e.g. for text segmentation)?
• What filtering matches what users think?
Judith L. Klavans
23
Techniques for Filtering
1. Take an initial guess
•
•
Collect input from users
Alter formulae based on feedback
2. Use automatic techniques to guess (machinelearning)
•
•
Collect input from users
Run programs to make predictions based on given
opinions (Bayesian networks, classifiers, decision
trees)
3. The CLiMB approach: Use both techniques!
Judith L. Klavans
24
Initial Manual Filter
• Increase score if proper noun;
• Decrease score if very frequent in Brown
corpus;
• Increase score if frequent in back-of-book
indexes;
• Increase score if particularly frequent in
domain specific texts;
• Increase score if present in authority lists
Judith L. Klavans
25
Early Results
Cordelia Culbertson
Greene
James Culbertson
James A. Culbertson
house
special furnishings Charles
Cordelia A. Culbertson house
Blacker house
Tichenor house
bedrooms
Greene furniture
Pacific Coast Architect
Culbertson residence
single-story elevation
Judith L. Klavans
26
Next Steps
• Filter “given” information (already in
catalogue record if you are lucky enough to
have one!)
• What does CLiMB get that is new?
• How much is useful?
• What is the “cost”?
Judith L. Klavans
27
Segmentation
• Determination of relevant segment
• Difficult for Greene & Greene
– The exact text related to a given image is difficult to
determine
– Use of TOI to find this text
• Easy for Chinese Paper Gods and for next
colleciton
• Decision: set initial values manually and explore
automatic techniques
Judith L. Klavans
28
Progress and Planning
• Collections
• Technology
• Users and Uses
• Interface Tools
Judith L. Klavans
29
Formative Evaluation Meeting
• At the advice of External Advisory Board
• October 17, 2003
• Goals:
– Get early feedback from many user types
– Incorporate that feedback into CLiMB toolset
– Help shape next steps
Judith L. Klavans
30
Formative Evaluation - Attendees
•
CLiMB Project Team
- Judith Klavans
- Roberta Blitz
- Rebecca Passonneau
- Angela Giral
- Vera Horvath
- David Elson
- Bob Wolven
- Stephen Davis
- Mark Weber
•
•
•
CLiMB: External Advisory Board
- Jeff Cohen (Bryn Mawr)
- Carl Lagoze (Cornell)
- Merrilee Proffitt (RLG)
Invitees
- Robert Carlucci (Columbia)
- Terry Catapano (Columbia)
- Paula Gabbard (Columbia)
- Deborah Kempe (Frick)
- Doug Oard (UMd)
Could not Attend
– Tony Gill (Mellon)
– Abby Goodrum (Syracuse)
– Elisa Lanzi (Smith)
Judith L. Klavans
31
Research Questions
• Will CLiMB metadata help users get access
to the digital images they want?
• Will these terms help catalogers provide this
access?
• How well are the CLiMB tools performing
in providing required metadata?
Judith L. Klavans
32
Formative Evaluation
Agenda:
http://www.columbia.edu/cu/cria/climb/meeting.html
Surveys:
http://www1.cs.columbia.edu/~delson/survey/gg-index.html
http://www1.cs.columbia.edu/~delson/survey/cpg-index.html
Judith L. Klavans
33
What phrases do people select?
ridge beams
gunite
Cordelia A. Culbertson house
Ludowici-Celadon Company
Cordelia Culbertson
extensive water gardens
nontimber materials
pergola
'U plan
enclosed court
James Culbertson
James A. Culbertson
single-story elevation
two-story height
Pasadena's Oak Knoll neighborhood
roof over-hangs
Judith L. Klavans
34
Results from Formative Evaluation
• Best – Humans select, CLiMB selects
– Cordelia A. Culbertson
• Better - Humans select, CLiMB might not
– Ludowici-Celadon Company
• Better – Humans might not but CLiMB selects
– house, Tichenor house, most significant house
• Good – Humans do not select, CLiMB does not
– problem, time
Judith L. Klavans
35
Use Results for Improvement
• Determine ways to better filter CLiMB
results
• Use input for improving ranking
Judith L. Klavans
36
Use Results for Improvement
1.
2.
3.
4.
5.
Use initial ranking to collect feedback
Compare CLiMB with user survey ranking
Analyze performance and study the errors
Refine formula
Repeat
Beware: Danger of tailoring to test texts
Judith L. Klavans
37
Raw Results
• Raw survey results are at
www.cs.columbia.edu/~delson/CLiMB/checklist-results.xls
• Survey results joined with CLiMB ranks, sorted by CLiMB
score: www.cs.columbia.edu/~delson/CLiMB/gg-joined-resultsby-rank.xls
• Survey results joined with CLiMB ranks, sorted by human
score: www.cs.columbia.edu/~delson/CLiMB/gg-joined-resultsby-survey.xls
• Quantized survey results (High/Medium/Low):
www.cs.columbia.edu/~delson/CLiMB/gg-quantized-results.xls
Judith L. Klavans
38
Progress and Planning
• Collections
• Technology
• Users and Uses
• Interface Tools
Judith L. Klavans
39
Interface Tools
• Planning the new interface for image professionals
to prepare CLiMB metadata from texts
• For catalogers / metadata specialists and visual
resources professionals
• Goals
– to provide a platform for a wider community
– to be able to collect feedback on CLiMB at a wider
level
– to complete the CLiMB interface “deliverable”
Judith L. Klavans
40
Interface Tools – Stay Tuned!
• CLiMB toolset currently implemented with textual
interface
– Fully-functional shell
• New graphical user interface (GUI) can be built on
top of existing codebase
– Perl/Tk
• Design
– Initiating design phase now
– Consulting metadata and image specialists
Judith L. Klavans
41
Next Steps
• External Advisory Board– June 2004
• Select project directions
• Potential partners
Judith L. Klavans
42
Thank you!
www.columbia.edu/cu/cria
Download