Lesk-technology.ppt

advertisement
Digital Library Technologies
Text: formats and storage
●Searching text
●Images
●Speech
●Multimedia
●Networking
●
Text formats
Ascii: simple, no formatting, accessible
HTML: simple, moderate formatting, accessible
word processors: formatting, access limited
PDF: formatted, complex, access limited
TEI: formatted, open, very complex
<sp who="Oph"><speaker><hi rend="i">Oph.</hi></speaker>
<p><lb n="46"/> Pray let's have no words of this, but when
<lb n="47"/> they ask you what it means, say you this:</p>
<stage> <hi rend="i">Song.</hi></stage>
<lg part="M" type="song">
<l n="48">"To-morrow is <rs key="StValentine">Saint Valentine's</rs>
day,</l>
<l n="49">All in the morning betime,</l>
<l n="50"> And I a maid at your window, </l>
Control of the format
Ascii: user has complete control of display
HTML: user has considerable control of display
PDF: publisher has all the control
Authors and readers disagree on who should decide things like
column layout, type size, etc. Over time, more and more Web
documents have the format nailed down.
Text compression
Basic strategies: statistics or dictionaries
Statistics: Morse code: the more frequent letters get shorter codes
Huffman coding is the traditional method here, but lengthening the alphabet
will give better results.
Dictionaries: Lempel-Ziv or LZW. Find repeated strings and list them at the
beginning.
Questions: instantaneously decodable? Is a factor of 2 worth the trouble?
Searching text files
Linear scan (grep): not for very big collections, no update problem
Inverted files: tries, or just divide by blocks
May wish to compress occurrence lists, index by both ends, allow
fielded searching, and keep frequency information
Signature files: electronic edge-notched cards, trading space for
false drops
Bitmaps: best for very common words; add to inverted files
Clustering: for complex searching, summarizing results
Case folding, suffixing, stop lists.
Grab – an example compromise
Grab was an attempt to balance between the speed of inversion and the
compactness of linear search.
Bitmap vectors on hashed words, compressed 10bits to 4 bits. Go back
later and cast out false drops.
For 5% extra space, get 90% speedup on linear.
Never caught on. Space is too cheap today, and files are too big. Might
as well use full inversion.
Why not a DBMS?
Why don't text retrieval systems use a DBMS underneath?
Few numerical entries, and vast numbers of items
Special needs, such as index browsing and truncation searching
Input not neatly structured into records, and variable length of items
may have to be retrieved
Not much updating.
Parallel searching: just coming into vogue.
What do the search engines do?
Very large inverted files and parallel search engines on a great many machines
(thousands).
Big caches. They may search only in the cache and avoid all disk delays
Are willing to give different results depending on what data is in cache
Collaborative ranking and filtering
Google is the best known search engine; it derives from “backrub” at
the Stanford digital library project.
See: http://www-db.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page.
Simply: pages pointed to by a lot of other people are probably better.
Other work from Jon Kleinberg at Cornell has looked at links in
both directions, and this is all related to “collaborative filtering”.
Image formats
There are a great many image formats. The best known are GIF and
JPG. Why are there so many?
Images are bulky. The best compression is “lossy”, and one can
choose what kinds of things to lose.
GIF loses color space: it is perfect on b&w.
JPG is more general
You can do non-lossy compression, eg. Tiff G4, just run ordinary
Huffman-like compression on the signal.
Wavelet and fractal compression are coming along. JPEG2000 may
replace JPEG someday.
DjVu is particularly interesting: oriented for text, it divides the page
into background and foreground and does wavelets on the
background and dictionary compression on the foreground.
Sound formats
Some technology, but mostly commerce
a) Digitization rates. You can do speech at 8 kHz, but for music you
ought to do better: CD music is 44.1 Khz.
b) Compression: You can get speech to 2400 baud or so, and music by a
factor of 10 (MP3 current favorite).
Commerce:
Real vs. Microsoft (WMP).
Digital rights management
Unlike text, few people can write sound manipulation software, and so
everyone is dependent on one or another vendor.
Video formats
Video is extremely bulky. With 24 frames/second (movies) or 30 (TV),
an hour of video is easily a gigabyte even with minimal resolution on
each image. But there is enormous scene-to-scene redundancy.
MPEG sequence: key frames and then differentially coded frames; JPEG
like coding on individual frames; prediction of moving objects.
MPEG-1: 1.5 Mbit/sec; MPEG-2: 4-9 Mbit/sec
MPEG-4: mixing synthetic (animation) with camera video
MPEG-7: metadata
The next real improvement is going to have to be longer-term storage and
segmentation, e.g. separating the background from a scene and keeping it
for many frames.
Image searching
QBIC: color, texture, some shape
Color histogram is easiest: beware any demo of sunsets
Current work at Berkeley better at segmentation & labeling
Image labeling
David Forsyth & Jitendra Malik, Berkeley
Sound searching
Speech: speech recognition; speaker identification
Music: we now have “hum & search” software. See Bill
Birmingham, U of Michigan; Donald Byrd, U. Mass.
Video searching
See Informedia, Howard Wactlar, CMU.
Combination of:
closed-captioning
speech recognition
face recognition
OCR of on-screen text
some image searching
Also a great deal of work on presentation.
Summary
There are lots of things in digital libraries today. And there are
more to come: 3-D objects, scientific data, software, …
All of this will have to be stored, organized and searched.
Download