Digital Library Technologies Text: formats and storage ●Searching text ●Images ●Speech ●Multimedia ●Networking ● Text formats Ascii: simple, no formatting, accessible HTML: simple, moderate formatting, accessible word processors: formatting, access limited PDF: formatted, complex, access limited TEI: formatted, open, very complex <sp who="Oph"><speaker><hi rend="i">Oph.</hi></speaker> <p><lb n="46"/> Pray let's have no words of this, but when <lb n="47"/> they ask you what it means, say you this:</p> <stage> <hi rend="i">Song.</hi></stage> <lg part="M" type="song"> <l n="48">"To-morrow is <rs key="StValentine">Saint Valentine's</rs> day,</l> <l n="49">All in the morning betime,</l> <l n="50"> And I a maid at your window, </l> Control of the format Ascii: user has complete control of display HTML: user has considerable control of display PDF: publisher has all the control Authors and readers disagree on who should decide things like column layout, type size, etc. Over time, more and more Web documents have the format nailed down. Text compression Basic strategies: statistics or dictionaries Statistics: Morse code: the more frequent letters get shorter codes Huffman coding is the traditional method here, but lengthening the alphabet will give better results. Dictionaries: Lempel-Ziv or LZW. Find repeated strings and list them at the beginning. Questions: instantaneously decodable? Is a factor of 2 worth the trouble? Searching text files Linear scan (grep): not for very big collections, no update problem Inverted files: tries, or just divide by blocks May wish to compress occurrence lists, index by both ends, allow fielded searching, and keep frequency information Signature files: electronic edge-notched cards, trading space for false drops Bitmaps: best for very common words; add to inverted files Clustering: for complex searching, summarizing results Case folding, suffixing, stop lists. Grab – an example compromise Grab was an attempt to balance between the speed of inversion and the compactness of linear search. Bitmap vectors on hashed words, compressed 10bits to 4 bits. Go back later and cast out false drops. For 5% extra space, get 90% speedup on linear. Never caught on. Space is too cheap today, and files are too big. Might as well use full inversion. Why not a DBMS? Why don't text retrieval systems use a DBMS underneath? Few numerical entries, and vast numbers of items Special needs, such as index browsing and truncation searching Input not neatly structured into records, and variable length of items may have to be retrieved Not much updating. Parallel searching: just coming into vogue. What do the search engines do? Very large inverted files and parallel search engines on a great many machines (thousands). Big caches. They may search only in the cache and avoid all disk delays Are willing to give different results depending on what data is in cache Collaborative ranking and filtering Google is the best known search engine; it derives from “backrub” at the Stanford digital library project. See: http://www-db.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page. Simply: pages pointed to by a lot of other people are probably better. Other work from Jon Kleinberg at Cornell has looked at links in both directions, and this is all related to “collaborative filtering”. Image formats There are a great many image formats. The best known are GIF and JPG. Why are there so many? Images are bulky. The best compression is “lossy”, and one can choose what kinds of things to lose. GIF loses color space: it is perfect on b&w. JPG is more general You can do non-lossy compression, eg. Tiff G4, just run ordinary Huffman-like compression on the signal. Wavelet and fractal compression are coming along. JPEG2000 may replace JPEG someday. DjVu is particularly interesting: oriented for text, it divides the page into background and foreground and does wavelets on the background and dictionary compression on the foreground. Sound formats Some technology, but mostly commerce a) Digitization rates. You can do speech at 8 kHz, but for music you ought to do better: CD music is 44.1 Khz. b) Compression: You can get speech to 2400 baud or so, and music by a factor of 10 (MP3 current favorite). Commerce: Real vs. Microsoft (WMP). Digital rights management Unlike text, few people can write sound manipulation software, and so everyone is dependent on one or another vendor. Video formats Video is extremely bulky. With 24 frames/second (movies) or 30 (TV), an hour of video is easily a gigabyte even with minimal resolution on each image. But there is enormous scene-to-scene redundancy. MPEG sequence: key frames and then differentially coded frames; JPEG like coding on individual frames; prediction of moving objects. MPEG-1: 1.5 Mbit/sec; MPEG-2: 4-9 Mbit/sec MPEG-4: mixing synthetic (animation) with camera video MPEG-7: metadata The next real improvement is going to have to be longer-term storage and segmentation, e.g. separating the background from a scene and keeping it for many frames. Image searching QBIC: color, texture, some shape Color histogram is easiest: beware any demo of sunsets Current work at Berkeley better at segmentation & labeling Image labeling David Forsyth & Jitendra Malik, Berkeley Sound searching Speech: speech recognition; speaker identification Music: we now have “hum & search” software. See Bill Birmingham, U of Michigan; Donald Byrd, U. Mass. Video searching See Informedia, Howard Wactlar, CMU. Combination of: closed-captioning speech recognition face recognition OCR of on-screen text some image searching Also a great deal of work on presentation. Summary There are lots of things in digital libraries today. And there are more to come: 3-D objects, scientific data, software, … All of this will have to be stored, organized and searched.