CS 430 / INFO 430 Information Retrieval Metadata 4 Lecture 17

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 17
Metadata 4
1
Course Administration
2
Automated Creation of Metadata
Records
Sometimes it is possible to generate metadata automatically from
the content of a digital object. The effectiveness varies from field
to field.
Examples
3
•
Images -- characteristics of color, texture, shape, etc. (crude)
•
Music -- optical recognition of score (good)
•
Bird song -- spectral analysis of sounds (good)
•
Fingerprints (good)
Automated Information Retrieval
Using Feature Extraction
Example: features extracted from images
• Spectral features: color or tone, gradient, spectral parameter etc.
• Geometric features: edge, shape, size, etc.
• Textural features: pattern, spatial frequency, homogeneity, etc.
Features can be recorded in a feature vector space (as in a term
vector space). A query can be expressed in terms of the same
features.
Machine learning methods, such as a support vector machine, can
be used with training data to create a similarity metric between
image and query
Example: Searching satellite photographs for dams in California
4
Example: Blobworld
5
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
6
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
7
Automatic Creation of Surrogates for
Non-textual Materials
Discovery of non-textual materials usually requires surrogates
• How far can these surrogates be created automatically?
• Automatically created surrogates are much less expensive than
manually created, but have high error rates.
• If surrogates have high rates of error, is it possible to have
effective information discovery?
8
Example: Informedia Digital Video
Library
Collections: Segments of video programs, e.g., TV and radio
news and documentary broadcasts. Cable Network News,
British Open University, WQED television.
Segmentation: Automatically broken into short segments of
video, such as the individual items in a news broadcast.
Size: More than 4,000 hours, 2 terabyte.
Objective: Research into automatic methods for organizing and
retrieving information from video.
Funding: NSF, DARPA, NASA and others.
Principal investigator: Howard Wactlar (Carnegie Mellon
University).
9
Informedia Digital Video Library
History
• Carnegie Mellon has broad research programs in speech
recognition, image recognition, natural language processing.
• 1994. Basic mock-up demonstrated the general concept of a
system using speech recognition to build an index from a sound
track matched against spoken queries. (DARPA funded.)
• 1994-1998. Informedia developed the concept of multi-modal
information discovery with a series of users interface
experiments. (NSF/DARPA/NASA Digital Libraries Initiative.)
• 1998 - . Continued research particularly in human computer
interaction. Commercial spin-off failed.
10
The Challenge
A video sequence is awkward for information discovery:
• Textual methods of information retrieval cannot be applied
• Browsing requires the user to view the sequence. Fast skimming
is difficult.
• Computing requirements are demanding (MPEG-1 requires 1.2
Mbits/sec).
Surrogates are required
11
Multi-Modal Information Discovery
The multi-modal approach to information retrieval
Computer programs to analyze video materials for clues
e.g., changes of scene
• methods from artificial intelligence, e.g., speech
recognition, natural language processing, image
recognition.
• analysis of video track, sound track, closed captioning if
present, any other information.
Each mode gives imperfect information. Therefore use
many approaches and combine the evidence.
12
Multi-Modal Information Discovery
With mixed content and mixed metadata, the amount of
information about the various resources varies greatly
but clues from many difference sources can be combined.
"The fundamental premise of the research was that the
integration of these technologies, all of which are
imperfect and incomplete, would overcome the
limitations of each, and improve the overall performance
in the information retrieval task."
[Wactlar, 2000]
13
Informedia Library Creation
Video
Audio
Text
Speech recognition
Image extraction
Natural language
interpretation
14
Segmentation
Segments
with derived
metadata
Informedia: Information Discovery
User
Querying via
natural
language
Requested segments
and metadata
Segments
with derived
metadata
15
Browsing via
multimedia
surrogates
Text Extraction
Source
Sound track: Automatic speech recognition using Sphinx II and III
recognition systems. (Unrestricted vocabulary, speaker independent,
multi-lingual, background sounds). Error rates 25% up.
Closed captions: Digitally encoded text. (Not on all video. Often
inaccurate.)
Text on screen: Can be extracted by image recognition and optical
character recognition. (Matches speaker with name.)
Query
Spoken query: Automatic speech recognition using the same system
as is used to index the sound track.
16
Typed by user
Image Understanding
Informedia has developed specialized tools for
various aspects of image understanding
• scene break detection
segmentation
icon selection
• image similarity matching
• camera motion and object tracking
• video-OCR (recognize text on screen)
• face detection and association
17
Multimodal Metadata Extraction
18
Speech Recognition: An Evaluation
Experiment
Test corpus:
• 602 news stories from CNN, etc. Average length 672 words.
• Manually transcribed to obtained accurate text.
• Speech recognition of text using Sphinx II (50.7% error rate)
• Errors introduced artificially to give error rates from 0% to 80%.
• Relative precision and recall (using a vector ranking) were used
as measures of retrieval performance.
As word error rate increased from 0% to 50%:
• Relative precision fell from 80% to 65%
• Relative recall fell from 90% to 80%
19
Speech recognition and retrieval
performance
20
User Interface Concepts
Users need a variety of ways to search and browse, depending
on the task being carried out and preferred style of working
• Visual icons
one-line headlines
film strip views
video skims
transcript following of audio track
• Collages
• Semantic zooming
• Results set
• Named faces
• Skimming
21
22
Thumbnails, Filmstrips and Video
Skims
Thumbnail:
• A single image that illustrates the content of a video
Filmstrip:
• A sequence of thumbnails that illustrate the flow of a video
segment
Video skim:
• A short video that summarizes the contents of a longer sequence,
by combining shorter sequences of video and sound that provide
an overview of the full sequence
23
Creating a Filmstrip
Separate video sequence into shots
• Use techniques from image recognition to identify dramatic
changes in scene. Frames with similar color characteristics are
assumed to be part of a single shot.
Choose a sample frame
• Default is to select the middle frame from the shot.
• If camera motion, select frame where motion ends.
User feedback:
• Frames are tied to time sequence.
24
Creating Video Skims
Static:
• Precomputed based on video and audio phrases
• Fixed compression, e.g., one minute skim of 10 minute sequence
Dynamic:
• After a query, skim is created to emphasize context of the hit
• Variable compression selected by user
• Adjustable during playback
25
Limits to Scalability
Informedia has demonstrated effective information discovery
with moderately large collections
Problems with increased scale:
• Technical -- storage, bandwidth, etc.
• Diversity of content -- difficult to tune heuristics
• User interfaces -- complexity of browsing grows with scale
26
Lessons Learned
• Searching and browsing must be considered integrated parts
of a single information discovery process.
• Data (content and metadata), computing systems (e.g.,
search engines), and user interfaces must be designed
together.
• Multi-modal methods compensate for incomplete or errorprone data.
27
Download