Digital Libraries Review and Reflection 1 Agenda • Review the course syllabus and see how we did • Review the key points of the semester • Reconsider some of the spot checks and thinking points • Update project status, prepare for next week’s presentations 2 What is a library • Our concept map exercise. – Do we have a better understanding now about what a modern library is? – What are the essential characteristics? • The 5S model for describing the components of a DL • Streams, spaces, structures, societies, scenarios 3 The Etana digital library of archeological artifacts Scenario model Society model Archaeologist General public Services Value added Service Manager Domain specific Space model Geographic space Structure model Region Stream model User interface Text *Partition Video Information Satisfaction Metric space Metadata *Site Repository building *Sub-partition Audio Taxonomies Spatial Temporal Artifact-specific *Locus Drawing *Container Photo *Artifact 3D 4 In the beginning • Vannevar Bush’s vision – How far have we come? – What did you notice about this article -- style or content or background or anything else. – Did the article suggest anything you would not want to see happen? 5 Applying the model, informally Looking back on your project plan • Stream - what types of data? gif, jpg, avi, docx, pdf, html? • Structure - How are the elements organized? Is there a hierarchy? Are there multiple structures? • Spaces - How will we index the items? How will we divide them into related groups • Scenarios - what services will we provide? What information do we need to provide those services? What events might happen that we need to plan for? • Societies - who is the library intended to serve? Remember to include agents and other processes as well as users. Describing the content • How to describe content – Metadata • Machine readable description of anything • What description – Machine readable requires standard descriptive elements • Dublin Core (http://dublincore.org/) – International standard – “a standard for cross-domain information resource description.” – 15 descriptive elements • Other metadata schemes – IEEE-LOM XML • • • • • XML is a markup language XML describes features There is no standard XML Use XML to create a resource type Separately develop software to interact with the data described by the XML codes. Source: tutorial at w3school.com Elements and attributes • Use elements to describe data • Use attributes to present information that is not part of the data –For example, the file type or some other information that would be useful in processing the data, but is not part of the data. Parts of an XML document • Elements – The components of an XML document – Some contain other parts, some are empty • Ex in HTML: “br” or “table” in XML “ingredient” • Attributes – Information about elements, not data • Ex in HTML “src=” in XML “scale=” The HTML examples are familiar; the XML examples are made up – dependent on the specific XML scheme used • Entities – Special characters or strings with pre-assigned meaning • Ex in HTML &nbsp for non-breaking space • PCDATA – Parsed Character data: text that will be parsed and interpreted by the reader. Tags and entities will be expanded and used in presentation. • CDATA – Character data: text that will not be parsed and interpreted. It will be displayed exactly as provided. Using XML - an example Define the fields of a recipe collection: <?xml version="1.0" encoding="ISO-8859-1"?> <recipe> <recipe-title> </recipe-title> <ingredient-list> <ingredient> <ingredient-amount> </ingredient-amount> <ingredient-name> </ingredient-name> </ingredient> </ingredient-list> <directions> ISO 8859 is a character set. </directions> </recipe> See http://www.bbsinc.com/iso8859.html <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE recipe SYSTEM “recipe.dtd”> External reference to DTD <recipe> <recipe-title> Meringue cookies</recipe-title> Not the way that I <ingredient-list> want to see a recipe in <ingredient> a magazine! <ingredient-amount>3 </ingredient-amount> <ingredient-name> egg whites</ingredient-name> What could we </ingredient> <ingredient> do with a large <ingredient-amount> 1 cup</ingredient-amount> collection of <ingredient-name> sugar</ingredient-name> such entries? </ingredient> <ingredient> <ingredient-amount>1 teaspoon </ingredient-amount> <ingredient-name> vanilla</ingredient-name> How would we </ingredient> <ingredient> get the <ingredient-amount>2 cups </ingredient-amount> information <ingredient-name>mini chocolate chips </ingredient-name> entered into a </ingredient> collection? </ingredient-list> <directions>Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off and leave overnight. </directions> </recipe> Vocabulary • Given the need for processing, do you want free text or restricted entries? • Free text gives more flexibility for the person making the entry • Controlled vocabulary helps with – Consistent processing – Comparison between entries • Controlled vocabulary limits – Options for what is said Dublin Core elements see: http://dublincore.org/documents/dces/ • • • • • • • • Title primarily responsible for Creator Entity making content of the resource Subject - C Description making the resource Publisher Entity available to content of Contributor Contributor the resource Date YYYY-MM-DD, ex. Type - C Ex: collection, dataset, • • • • • • • What is needed to display or operate the resource. Format - C Unambiguous ID Identifier Source Language Standards RFC 3066, ISO639 Relation Ref. to related resource Coverage - C Space, time, jurisdiction. Rights Rights Management information event, image C = controlled vocabulary recommended. Dublin Core Terms • An update to the original DC elements – Adds the concept of range and domain Each term has this minimal set of attributes: • Name: A token appended to the URI of a DCMI namespace to create the URI of the term. • Label: The human-readable label assigned to the term. • URI: The Uniform Resource Identifier used to uniquely identify a term. • Definition: A statement that represents the concept and essential nature of the term. • Type of Term: The type of term as described in the DCMI Abstract Model [DCAM]. DC Terms Additional Attributes possible: • • • • • • • • • • • • Comment: Additional information about the term or its application. See: Authoritative documentation related to the term. References: A resource referenced in the Definition or Comment. Refines: A Property of which the described term is a Sub-Property. Broader Than: A Class of which the described term is a Super-Class. Narrower Than: A Class of which the described term is a Sub-Class. Has Domain: A Class of which a resource described by the term is an Instance. Has Range: A Class of which a value described by the term is an Instance. Member Of: An enumerated set of resources (Vocabulary Encoding Scheme) of which the term is a Member. Instance Of: A Class of which the described term is an instance. Version: A specific historical description of a term. Equivalent Property: A Property to which the described term is equivalent. The DC Terms – from 15 to … abstract, accessRights, accrualMethod, accrualPeriodicity, accrualPolicy, alternative, audience, available, bibliographicCitation, conformsTo, contributor, coverage, created, creator, date, dateAccepted, dateCopyrighted, dateSubmitted, description, educationLevel, extent, format, hasFormat, hasPart, hasVersion, identifier, instructionalMethod, isFormatOf, isPartOf, isReferencedBy, isReplacedBy, isRequiredBy, issued, isVersionOf, language, license, mediator, medium, modified, provenance, publisher, references, relation, replaces, requires, rights, rightsHolder, source, spatial, subject, tableOfContents, temporal, title, type, valid Source: www.cs.cornell.edu/courses/cs502/2002sp/.../lecture%202-26-02.ppt Using Dublin Core • Dublin core fields are attached to a resource – separate metadata file associated with the main item – as META tags within the html Example:< META name = “DC.Title” content = “Novi Belgii Novæque Angliæ:nec non partis Virginiæ tabula multis in locis emendata ” lang = “la” > 18 Source: http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter7.html Framework for Access Management Legal and Technical Issues • Legal: When is a resource available to digitize and make available? What requirements exist for controlling access? • Technical: How do we control access to a resource that is stored online? – Policies – Encoding – Distribution limitations source: http://www.unc.edu/~unclng/public-d.htm Public Domain • Definition: A public domain work is a creative work that is not protected by copyright and which may be freely used by everyone. The reasons that the work is not protected include: – (1) the term of copyright for the work has expired; – (2) the author failed to satisfy statutory formalities to perfect the copyright or – (3) the work is a work of the U.S. Government. Even in the case of public domain, the origin of the work should be noted if possible. Date of work Protected from Term Created 1-1-78 or after When work is fixed in tangible medium of expression Life + 70 years1(or if work of corporate authorship, the shorter of 95 years from publication, or 120 years from creation Published before 1923 In public domain None Published 1923 - 63 When published with notice 28 years + could be renewed for 47 years, now extended by 20 years for a total renewal of 67 years. If not so renewed, now in public domain Published from 1964 77 When published with notice 28 years for first term; now automatic extension of 67 years for second term Created before 1-1-78 but not published 1-1-78, the effective date of the 1976 Act which eliminated common law copyright Life + 70 years or 12-31-2002, whichever is greater Created before 1-1-78 but published between then and 1231-2002 1-1-78, the effective date of the 1976 Act which eliminated common law copyright Life + 70 years or 12-31-2047 whichever is greater Chart created by Lolly Gasaway. Updates at http://www.unc.edu/~unclng/public-d.htm Fair use • No clear, easy answers. • Checklist provided in the article is a good guide to the issues. • Link to the checklist: http://www.nyu.edu/its/humanities/ninchguide/IV/ – Search for checklist Moral rights • Fair to the creator – Keep the identity of the creator of the work – Do not cut the work – Generally, be considerate of the person (or institution) that created the work. Getting Permission • With the best will in the world, getting the appropriate permissions is not always easy. – Identify who holds the rights – Get in touch with the rights holder – Get a suitable agreement to cover the needs of your use. • Useful links: http://www.loc.gov/copyright/ http://www.copylaw.com/new_articles/permission.html http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter1/1b.html http://www.k-state.edu/academicpersonnel/intprop/permission.htm - Includes sample letters requesting permission . Checking copyright status Source: NINCH Guide to Good Practice. Chapter 4: Rights Management Considering people depicted in the work Source: NINCH Guide to Good Practice. Chapter 4: Rights Management Copyright: Lauryn G. Grant Recall: Spot check Part 1: 5-7 minutes • Working in groups of two or three, construct a scenario of when a work might be used. – – – – Put it in your digital library? Quote it in a paper? Use it for an assignment? Use it to bolster an argument? • Be specific about the exact nature of the work. Image? text? when created, who created it, etc. Spot check Part 2 – Rights management 10 – 15 minutes • Pass your scenario to another group, and receive one in turn. • Make a decision about the rights management issues related to the work you received. • Write a brief summary of the issues involved and what needs to be done Technical issues • Rights management is about policy and technical issues. We looked at policy until now. Now for the technical issues. • Link the resource to the copyright statements • Maintain that link when the resource is copied or used • Approaches: – – – – Steganography Encryption Digital Wrappers Digital Watermarks Issues in Encryption • General cases for protection of controlled content: Concern for passive listening, active interference. – Listening: intruder gains information, may not be detected. Effects indirect. – Active interference • Intruder may prevent delivery of the message to the intended recipient. • Intruder may substitute a fake message for the intended one • Effects are direct and immediate • Less likely in the case of digital library content Message interception Encoding Method Ciphertext Eavesdropping Decoding Method Masquerading Original message Received message (Plain text) (Plain text) Intruder Types of Encryption Methods • Substitution – Simple adjustment, Caesar’s cipher • Each letter is replaced by one that is a fixed distance from it in the alphabet. A becomes D, B becomes E, etc. At the end, wrap around, so X becomes A, Y becomes B, Z becomes C. • May have been confusing the fist time it was done, but it would not have taken long to figure it out. • Note the simple example at geocaching.com: No intention to hide or confuse. Just keep a person from seeing too much information about the hide, unless the person wants to see the help. – Simple substitution of other characters for letters -- numbers, dancing men, etc. – More complex substitution. No pattern to the replacement scheme. • See common cryptogram puzzles. These are usually made easier by showing the spaces between the words. (For very modern version, see http://www.cryptograms.org/) Dancing Men???? • Arthur Conan Doyle: The Adventure of the Dancing Men. A Sherlock Holmes Adventure. “Speaking roughly, T, A, O, I, N, S, H, R, D, and L are the numerical order in which letters occur; but T, A, O, and I are very nearly abreast of each other, and it would be an endless task to try each combination until a meaning was arrived at.” Read the story online and see the images and analysis of the decoding at http://camdenhouse.ignisart.com/canon/danc.htm Types of encryption - 2 Hiding the text. Definition from www.webopedia.com • The wax tablet example – message written on the base of the tablet and wax put over top of it with another message on the wax • Steganography: (ste-g&n-o´gr&-fē) (n.) The art and science of hiding information by embedding messages within other, seemingly harmless messages. Steganography works by replacing bits of useless or unused data in regular computer files (such as graphics, sound, text, HTML, or even floppy disks ) with bits of different, invisible information. This hidden information can be plain text, cipher text, or even images. • Special software is needed for steganography, and there are freeware versions available at any good download site. • Can be used to insert identification into a file to track its source. Types of encryption - 3 • Key-based shuffling – Using a mnemonic to make the key easy to remember. • A machine to do the shuffling A A B B C C D D What shuffling is used? How would “CAB” look? Monoalphabetic codes • Any kind of substitution in which just one letter (or other symbol) represents one letter from the original alphabet is called monoalphabetic encoding. – Such codes are easy to break. That is what you do when you solve cryptograms. – Frequency distribution of letters in normal text for a given language are well known. • “The twelve most frequently-used letters in the English language are ETAOIN SHRDL, in that order.” -- http://www.cryptograms.org/ Letter distributions in English A 7.81% N 7.28% TH 3.18 OU 0.72 THE 6.42 B 1.28 O 8.21 IN 1.54 IT 0.71 OF 4.02 C 2.93 P 2.15 ER 1.3 ES 0.69 AND 3.15 D 4.11 Q 0.14 RE 1.30 ST 0.68 TO 2.36 E 13.05 R 6.64 AN 1.08 OR 0.68 A 2.09 F 2.88 S 6.46 HE 1.08 NT 0.67 IN 1.77 G 1.39 T 9.02 AR 102 HI 0.68 THAT 1.25 H 5.85 U 2.77 EN 1.02 EA 0.64 IS 1.03 I 6.77 V 1.00 TI 1.02 VE 0.64 I 0.94 J 0.23 W 1.49 TE 0.98 CO 0.59 IT 0.93 K 0.42 X 0.30 AT 0.88 DE 0.55 FOR 0.77 L 3.60 Y 1.51 ON 0.84 RA 0.55 AS 0.76 M 2.62 Z 0.09 HA 0.84 RO 0.55 WITH 0.76 SOURCE: Tannenbaum Computer Networks 1981 Prentice Hall Recall: Spot Check • Go to the cryptogram site (www.cryptograms.org) and solve a puzzle. • Work in groups of two or three • What information is helpful? • What makes a puzzle hard? • Suppose there were no spaces between the words? Then what would you do? Disguising frequencies • First trick: use more than 26 symbols and use several different symbols to represent the same letter. The goal is to even out the distribution. • Ex. Use the letters plus the digits. – 36 symbols – Assign five symbols to the letter E, two to the letter I, three to the letter N, two each to R and S. Examples and breaking: http://www.cs.trincoll.edu/~crypto/historical/vigenere.html More complex • Vigenere’s table • Arrange all the letters of the alphabet 26 times, in parallel columns, such that each column begins with a different letter, first A, then B, etc. • Encode each letter by using a different column for each successive letter of the message. • How to know which column to use? Use a keyword. Vigenere Cypher Write out the message Write the key over the message, repeating as many times as necessary. To encrypt, use the ROW corresponding to the key letter and find the intersection with the COLUMN of the plaintext letter. Reverse to decrypt (Use the COLUMN of the key and scroll down to the row indicated by the cyphertext. The intersection shows the plaintext. • Question -- how long should the keyword be? Long is hard to remember, short repeats too often. Recall: Spot Check • Make up a key • Encode a plain text message (not more than 20 characters, but at least 10) • Pass the key and the encoded message to another team. • Decode the message you receive. How secure? • The Vigenere cipher looks really hard, but is not secure. Since the keyword repeats, it is really just a bunch of monoalphabetic codes. If you can figure out the length of the keyword, you can do standard analysis. • (It was considered unbreakable for nearly 300 years) • Making it harder - instead of a regular arrangement of the letter columns, scramble them in some arbitrary way. – Makes decoding much more difficult, but also makes it difficult to have the arrangement known to the people who are supposed to be able to read the message. Enigma • Suppose we take a conversion for the first letter of the message and a different mapping for the next letter and a different mapping for the next letter … • That is what we did with Vigenere • Add additional encodings. Rotate from a fixed starting point through 26 positions of the first set of columns, then iterate a second set of columns. Now have 676 different mappings. • To decode, must figure out the wiring inside each phase, and the order in which they are arranged in the machine. Enigma - 2 Encryption/Decryption Keys • Problem is that you have to get the key to the receiver, secretly and accurately. • If you can get the key there, why not use the same method to send the whole message? (Efficiency of scale) • If the key is compromised without the communicators knowing it, the transmissions are open. • Exact working of the enigma machine: – http://www.codesandciphers.org.uk/enigma/example1.htm – How Polish mathematicians broke the enigma – http://www.codesandciphers.org.uk/virtualbp/poles/poles.htm Summary of encryption goals • • • • • • • High level of data protection Simple to understand Complex enough to deter intruders Protection based on the key, not the algorithm Economical to implement Adaptable for various applications Available at reasonable cost Data Encryption Standard • Complex sequence of transformations – hardware implementations speed performance – modifications have made it very secure • Known algorithm – security based on difficulty in discovering the key • http://www.itl.nist.gov/fipspubs/fip46-2.htm The Data Encryption Standard Illustrated 64 bit blocks, 64 bit key Federal InformationProcessing Standards 46-2 http://www.itl.nist.gov/fipspubs/fip46-2.htm INTERNET-LINKED COMPUTERS CHALLENGE DATA ENCRYPTION STANDARD LOVELAND, COLORADO (June 18, 1997). Tens of thousands of computers, all across the U.S. and Canada, linked together via the Internet in an unprecedented cooperative supercomputing effort to decrypt a message encoded with the government-endorsed Data Encryption Standard (DES). Responding to a challenge, including a prize of $10,000, offered by RSA Data Security, Inc, the DESCHALL effort successfully decoded RSADSI's secret message. According to Rocke Verser, a contract programmer and consultant who developed the specialized software in his spare time, "Tens of thousands of computers worked cooperatively on the challenge in what is believed to be one of the largest supercomputing efforts ever undertaken outside of government." Using a technique called "brute-force", computers participating in the challenge simply began trying every possible decryption key. There are over 72 quadrillion keys (72,057,594,037,927,936). At the time the winning key was reported to RSADSI, the DESCHALL effort had searched almost 25% of the total. At its peak over the recent weekend, the DESCHALL effort was testing 7 billion keys per second. Public Key encryption • Eliminates the need to deliver a key • Two keys: one for encoding, one for decoding • Known algorithm – security based on security of the decoding key – note, no key delivery problem • Essential element: – knowing the encoding key will not reveal the decoding key Effective Public Key Encryption • Encoding method E and decoding method D are inverse functions on message M: – D(E(M)) = M • Computational cost of E, D reasonable • D cannot be determined from E, the algorithm, or any amount of plaintext attack with any computationally feasible technique • E cannot be broken without D (only D will accomplish the decoding) • Any method that meets these criteria is a valid Public Key Encryption technique It all comes down to this: • key used for decoding is dependent upon the key used for encoding, but the relationship cannot be determined in any feasible computation or observation of transmitted data Rivest, Shamir, Adelman (RSA) • Choose 2 large prime numbers, p and q, each more than 100 digits • Compute n=p*q and z=(p-1)*(q-1) • Choose d, relatively prime to z • Find e, such that e*d=1 mod (z) – or e*d mod z = 1, if you prefer. • This produces e and d, the two keys that define the E and D methods. Public Key encoding • Convert M into a bit string • Break the bit string into blocks, P, of size k What was the problem here? – k is the largest integer such that 2k<n – P corresponds to a binary value: 0<P<n • Encoding method – E = Compute C=Pe(mod n) • Decoding method – D = Compute P=Cd(mod n) • e and n are published (public key) • d is closely guarded and never needs to be disclosed This version of the algorithm comes from Tannenbaum, Computer Networks. An example: • • • • • • • • P=7; q=11; n=77; z=60 d=13; e=37; k=6 Test message = CAT Using A=1, etc and 5-bit representation: – 00011 00001 10100 Since k=6, regroup the bits (arrange right to left so that any padding needed will put 0's on the left and not change the value): – 000000 110000 110100 decimal equivalent: 0 48 52 (three leading zeros added to fill the block) Each of those raised to the power 37 (e) mod n: 0 27 24 Each of those values raised to the power 13 (d) mod n (convert back to the original): 0 48 52 A practical note • There is a lot more to security than encryption. • Encryption coding is done by a few experts • Understanding how the common encryption algorithms work is useful in choosing the right approach for your situation. • Our interest here is in providing assurance that access to protected resources will be limited to those with legitimate rights. On a practical note: PGP • You can create your own real public and private keys using PGP (Pretty Good Privacy) • See the following Web site for full information. • (MIT site - obsolete) • http://www.pgpi.org/products/pgp/versions/freeware/ • http://www.freedownloadscenter.com/Utilities/Required_Files/PGP. html Issues • Intruder vulnerability – If an intruder intercepts a request from A for B’s public key, the intruder can masquerade as B and receive messages from B intended for A. The intruder can send those same or different messages to B, pretending to be A. – Prevention requires authentication of the public key to be used. • Computational expense – One approach is to use Public Key Encryption to send the Key for use in DES, then use the faster DES to transmit messages Digital Signatures • Some messages do not need to be encrypted, but they do need to be authenticated: reliably associated with the real sender – Protect an individual against unauthorized access to resources or misrepresentation of the individual’s intentions – Protect the receiver against repudiation of a commitment by the originator Digital Signature basic technique Intention to send Sender A E(Random Number) where E is A’s public key Message and D(E(Random Number)) = Random Number, decoded as only A could do Receiver B Public key encryption with implied signature • • • • • Add the requirement that E(D(M)) = M Sender A has encoding key EA, decoding key DA Intended receiver has encoding (public) key EB. A produces EB(DA(M)) Receiver calculates EA(DB(EB(DA(M)))) – Result is M, but also establishes that only A could have encoded M Digital Signature Standard (DSS) • Verifies that the message came from the specified source and also that the message has not been modified • More complexity than simple encoding of a random number, but less than encrypting the entire message • Message is not encoded. An authentication code is appended to it. Digital Signature – SHA (Secure Hash Algorithm) FIPS Pub 186 - Digital Signature Standard http://www.itl.nist.gov/fipspubs/fip186.htm Encryption summary • Problems – intruders can obtain sensitive information – intruder can interfere with correct information exchange • Solution – disguise messages so an intruder will not be able to obtain the contents or replace legitimate messages with others Important methods • DES – fast, reasonably good encryption – key distribution problem • Public Key Encryption – more secure • based on the difficulty of factoring very large numbers – no key distribution problem – computationally intense Digital signatures • Authenticate messages so the sender cannot repudiate the message later • Protect messages from changes during transmission or at the receiver’s site • Useful when the contents do not need encryption, but the contents must be accurate and correctly associated with the sender Recall: Your turn • You receive this: 000000011011011000 – There will not be any convenient spacing between the blocks in the transmitted message. That is not necessary. • It has been encoded with your public key. You have the private key, d = 13 n = 77 • Show the decoding process A practical note • There is a lot more to security than encryption. • Encryption coding is done by a few experts • Understanding how the common encryption algorithms work is useful in choosing the right approach for your situation. • Our interest here is in providing assurance that access to protected resources will be limited to those with legitimate rights. Understanding Quality in a DL • Quality indicators: proposed descriptions of quantities or observable variables that may be related to quality – “measures” = stronger term. Requires validation – Gonçalves et al provide analysis of quality conditions and recommend specific quantities to be used. • Dimensions of quality • Proposed indicators • Application to DL concerns Getting the data • Where does the data come from? – Logging – Surveys – Focus Groups • Know what information is needed, then choose the method most likely to provide the data. – More about the sources of data after we see what we need to know. What are we looking for? • What characteristics of a digital library raise questions about quality? – – – – – – Data objects Metadata Collection Catalog Repository Services • What characteristics do we want each of those to have? Dimensions of Quality • Digital Object – – – – – – – Accessibility Pertinence Preservability Relevance Similarity Significance Timeliness • Metadata Specification – Accuracy – Completeness – Conformance • Collection – Completeness • Catalog – Completeness – Consistency • Repository (may hold more than one collection) – Completeness – Consistency • Services – – – – – – Composability Efficiency Effectiveness Extensibility Reusability Reliability Recall: Spot check • For your digital library project, – how will you define quality for each of these factors? • • • • • • Data objects Metadata Collection Catalog Repository Services What is your intention, or your goals, for each of these? I will ask each group to present two of these (briefly), but prepare all of them. Information need - Digital Objects • Accessibility – – – – What collection? # of structured streams Rights management metadata Communities to be served • Relevance – Feature frequency – Inverse document frequency – Document size – Document structure – Query size – Collection size • Significance – Citation/link patterns • Preservability – – – – Fidelity (lossiness) Migration cost Digital object complexity Stream formats • Pertinence – Context – Information content – Information need • Similarity – All the same features as in relevance – Also: citation/link patterns • Timeliness – Age – Time of latest citation – Collection freshness Information need - Metadata Specification • Accuracy – Accurate attributes – # attributes in the record • Completeness – Missing attributes – Schema size • Conformance – Conformant attributes – Schema size Information - Collection and Catalog • Completeness of the Collection – Collection size – Size of an “ideal” collection • Completeness of the Catalog – # of digital objects with no metadata • Item level metadata – Size of the collection • Catalog Consistency – # of metadata specifications per digital object Information about the Repository • Completeness – # of collections • Consistency – # of collections – Catalog/collection match • How well do the catalogs match the collections? • Are the catalogs for all the collections at the same level of detail? Service Information Need • Composability (ability to be combined to form new services) – Extensibility – Reusability • Efficiency – Response time • Effectiveness – Precision/recall (of search) – Classification • Extensibility – # extended services – # services in the DL – # lines of code per service manager • Reusability – # reused services – # services in the DL – # lines of code per service manager • Reliability – # service failures – # accesses Making more concrete • Each of the measures listed gives an idea of the information need • Exactly what do we measure? • How do we combine numbers obtained to get a usable result? • Following pages describe specific measures and formulas for combining those. Solidifying Pertinence • How do we measure something like pertinence? • Relation between the information content of a digital object and the need of the user • Depends on the user’s situation -background, current context, etc. Preservability • Property of a digital object that describes its state relative to changes in hardware and software, representation format standards – Ex new recording technologies (replacement of VHS video tapes by DVDs) – New versions of software such as Word or Acrobat – New image standards such as JPEG 2000 Digital preservation techniques • Migration Most commonly used – Transform from one format to another • Ex. Open the document in one format and save in another or do an automated transformation • Emulation – Reproducing the effect of the environment originally used to display the material • Keep an old version of the software, or have new software that can read the old format • Wrapping – Keep the original format, but add enough human-readable metadata so that it can be decoded in the future • Note that the material is not directly usable • Refreshing – Copy the stream of bits from one location to another • Particularly suitable for guarding against the physical deterioration of the medium Preservability issues • Obsolescence – How out of date is the digital object? • Many versions of the software? • Old storage media? – Difficult to migrate Miniclip Internet Archive • Appropriate tools? Expertise? • Fidelity – How different is the migrated version from the original? – Distortion = loss of information • Preservability of a digital object in a digital library is a function of the fidelity of the migration and the obsolescence of the object • Preservability(doi, dl) = (fidelity of migrating (doi, formatx, formaty), obsolescence(doi, dl)) – Two values to reflect the two dimensions of the concept: fidelity and obsolescence Preservability factors • Capital direct costs – Software • Developing software to create new versions of the object or obtaining licenses for new versions of the original software – Hardware • For processing the migration and for storing the results • Indirect operating costs – – – – Monitoring digital objects for migration needs Maintaining up-to-date intellectual property rights Storage Staff training Significance • Significance is an expression of the absolute usefulness of a given digital object, independent of particular user needs. • Citation records of objects in digital libraries offer one measure of significance. (This disadvantages the most recently obtained objects, since they have had less time to be cited by others.) Look at ACM DL and the citation counts, for example. Life Cycle and Quality • The quality indicators relate to the core components of a digital library – creation, use, finding, distribution. • Creation – Authoring, modifying – Describing, Organizing, Indexing • Use – Access, filtering • Finding (seeking) – Searching, Browsing, recommending • Distribution – Storing – Archiving – Networking Quality and Lifecycle - 2 Quality and Life Cycle - 3 • Note that some elements repeat – Timeliness is relevant to the content and to the metadata that describes the content – Accessibility affects both usefulness and distribution. Digital Library User Interface and Usability Goals: • Discover elements of good interface design for digital libraries of various sorts • Consider examples from DL usability evaluation as sources of insight. • Look at the distinct requirements of interfaces to libraries of video and audio files Methods of evaluation • Surveys • Target user groups – Focus groups from the intended audiences – another recommendation: faux focus groups • When it is not practical to do a real focus group for a while, the developers do some role playing, pretend to be users • What do you think of this approach? • Ethnographic studies – Audio/video taped sessions of users – Analysis of feedback and comments • Demographic analysis of beta tester registration data • Log analysis Hill 97 Recall: Your plans • How will you evaluate the usability of your digital library? – What is ideal? – What is practical? – What do you plan to do? Take two or three minutes to think about it and jot notes to yourself. If you are part of a team, do this on your own. (You can compare your team’s responses later.) Then we will hear from each of you. Evaluation • Evaluation for any purpose has two major components – Formative • During development, spot check how things are progressing • Identify problems that may prevent goals from being achieved • Make adjustments to avoid the problems and get the project back on track – Summative • After development, see how well it all came out • Lessons learned may be applicable to future projects, but are too late to affect the current one. • Needed for reporting back to project sponsors on success of the work. Recall: Spot check • Divide into pairs so that each member of the pair is from a different project. • One member of the group looks at the other person’s project. – Try to use it – Give feedback on usability • Switch to the other project and repeat • About 5 - 7 minutes on each project • What kind of evaluation was this? Did this exercise lead to any changes in your project plans? Usability evaluation • Lab-based formative evaluation – – – – Real and representative users Benchmark tasks Qualitative and quantitative data Leads to redesign where needed • After deployment – Real users doing real tasks in daily work – Summative with respect to the deployed system – Useful for later versions Sample evaluation • Digital library evaluated by a usability expert, results reported. – See references: Hartson and Perez-Quiñones – NCSTRL (“Networked Computer Science Technical Reference Library”) – Still present at ncstrl.org We looked at this in some detail and gathered some guidelines. Categories of Problems • General to most applications, GUIs – Wording – Consistency – Graphic layout and organization – User’s model of the system • Digital Library functionality – – – – Browsing Filtering Searching Document submission functions Hartson 04 Guidelines discovered • Standardize terminology and check it carefully • Clearly indicate where the user is in the overall system • Use terms that are meaningful to users without explanation whenever possible. Resist presenting data that is not useful for user purposes. • Label for the user, not the developer • Label results appropriately, even scrupulously, for their real meaning. • Cosmetic consideration can have a positive affect on user’s impression of the site. • Organize task interfaces by categories to present a structured system model and reduce cognitive workload. 99 Guidelines, continued • Consider the implications of placement and association of graphical elements. • Any application should have a home page that explains what the site is about and gives the user a sense of the overall site capability and use. • Usability suggestion: combine search, browse, filter into one selection and navigation facility. • Allow user activity that will serve user needs. Try to find out what users want before making decisions about services offered • Link directly to the service offered without any intermediate pages unless needed in support of the service. 100 Recall: Spot check • Given all those points (in the bold and different color type), pick one that strikes you as especially relevant for your project and say how you will address it. – Do consider all of them when working on your project – just pick one to talk about now. Source: Lee 02 Video Digital Libraries • Video digital libraries offer more challenges for interface design – Information attributes are more complex • Visual, audio, other media – Indicators and controlling widgets • Start, stop, reverse, jump to beginning/end, seek a particular frame or a frame with a specified characteristic Note: Youtube started in 2005. This material was published in 2002. It is of interest to see how youtube compares with the desired characteristics of a video library Source: Lee 02 Summarizing stages of information seeking and the interface elements that support them as described in four researchers’ work. Recall: A scenario • Directory with 100 (or 1000 or…) video files. • No information except the file name. – Maybe reasonable name, but not very descriptive • You want to find a particular clip from a party or a ceremony or some other event. • What are your options? • What would you like to have available? Spend a bit of time now talking about this. Source: Lee 02 Video Abstraction • Levels to present: (from Shneiderman 98) – Overview first – Zoom and Filter – Details on Demand • Example levels (from Christel 97) – – – – Title: text format, very high level overview Poster frame: single frame taken from the video Filmstrip: a set of frames taken from the video Skim: multiple significant bits of video sequences • Time reference – Significant in video – Options include simple timeline, text specification of time of the current frame, depth of browsing unit Source: Lee 02 Keyframe browsing • Extract a set of frames from the video – Display each as a still image – Link each to play the video from that point • Selection is not random – Video analysis allows recognition • Sudden change of camera shot • Scenes with motion or largely stationary – Video indexing based on frame-by-frame image comparison • Similar to thumbnail browsing of image collections Keyframe extraction for display on browsing interface Source: Lee 02 Source: Lee 02 Keyframe extraction • Manual – Owner or editor explicitly selects the frames to be used as index elements • Automatic – Subsampling - select from regular intervals • Easy, but may not be the best representation – Automatic segmentation - break the video into meaningful chunks and sample each • Shot boundary detection - note switch from one camera to another, or distinct events from one camera Metadata Harvesting Interoperable digital collections Distributed libraries • The reality in most digital libraries is that no one location has all the materials that may be of interest. • It is often more efficient to allow a number of sites each to retain some of the materials. • How can we assure clients that they will see all relevant resources, regardless of which library they search? What I was doing this week • Discussion and summary of the NSDL program experience. – Questions about the appropriateness of the library metaphor – These apply to any digital library. Is that the right term? • Interoperability relates to inter-library loan, common catalog availability, branch libraries, community specific libraries, etc. 111 Two basic approaches • One service provider with access to resources stored in multiple locations – Information about all the resources located at the service provider. – Services (DL scenarios) use the information to provide connections to resources at multiple locations • Distributed services – Information kept with the resources – Services, local to each collection, interact with other collection sites Distributed Resources Multiple Services Approach 1 - One service Data provider provider gathers information about data and uses it to provide services Data provider Data provider Service provider -search, browse, compare, etc. Data provider Data provider Distributed data and services Approach 2: Each system is both a data repository and a service provider. Services query other data providers as needed. Search, browse Search, browse, compare Hybrid systems Each server likely to have its own clients. Difference is whether the information exchange is periodic or ad hoc Data provider Data/ service provider Data/ service provider Service provider -search, browse, compare, etc. Data/ service provider Data provider Open Archives Initiative (OAI) • Web-based – Uses HTTP to communicate between sites • Centralized server – Services provided from a site that has already gathered the information it needs for those services from a distributed collection of sites. http://www.openarchives.org/pmh/ OAI PMH • Interoperability through Metadata Exchange • The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAIPMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP. OAI - ORE http://www.openarchives.org/ore/ • Aggregations of Web Resources • Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of Web resources. These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video. The goal of these standards is to expose the rich content in these aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. Although a motivating use case for the work is the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, the intent of the effort is to develop standards that generalize across all web-based information including the increasing popular social networks of “web 2.0”. http://www.openarchives.org ore/1.0/primer.html#Example OAI - ORE • ORE allows aggregation of related web pages to form a logical unit – The representation allows access to all of the components of a resource at once. Open Archives Initiative Protocol for Metadata Harvesting -- OAI-PMH Implemented as CGI, ASP, PHP, or other HTTP req (OAI verb) OAI OAI HTTP resp (XML) Metadata Provider Any system may serve as a harvester, repository, or both Harvester Repository OAI PMH defines an interface between the Harvester and any number of Repositories Service Provider OAI - PMH components Service Providersand Data Providers Requests and Responses http://www.oaforum.org/tutorial/english/page3.htm#section3 Records • Metadata of a resource. • Three parts – Header (required) • • • • Identifier (required: 1 only) Datestamp (required: 1 only) setSpec elements (optional: 0, 1, or more) Status attribute for deleted item – Metadata (required) • XML encoded metadata with root tag, namespace • Repositories must support Dublin Core, other formats optional – “About” statement (optional) • Right statements • Provenance statements Interoperability • The goal: communication, without human intervention, between information sources – Books that “talk to each other” • Live links for references • Knowledge of how to find relevant resources when needed • Ability to query other information locations Protocols • Precise rules for interactions between independent processes – Format of the messages • Both structure and content – Specified behavior in response to specific messages • Many ways to accomplish the same result, but both sides must have the same understanding of the rules of engagement. Information Retrieval • This was the most recent part of the class, so will not occupy a prime position in this review • Digital libraries may be considered a sub discipline of IR 125 Information Retrieval • The basic problem – Given a collection of materials, devise methods for efficiently finding what is wanted. • Issues: – Precision • Getting only materials that match the search need – Recall • Getting all the materials that match the search need – These are usually contrary requirements 126 Representing the collection • Efficient access to content depends on the information available about the materials • Indexing – Producing pointers to materials • pointers must be accurate and complete • pointers must be searchable, efficiently 127 One more time: the scale of the problem Yotta Soon most everything will be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies These require algorithms, data and knowledge representation, and knowledge of the domain Exa All Books MultiMedia See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Slide source Jim Gray – Microsoft Research (modified) Zetta Everything Recorded ! All books (words) A movie A Photo A Book Peta Tera Giga Mega Kilo Inverted index construction Documents to be indexed. Token stream. Modified tokens. Friends, Romans, countrymen. Tokenizer Romans Countrymen friend roman countryman Linguistic modules Stop words, stemming, capitalization, cases, etc. Inverted index. Friends Indexer friend 2 4 roman 1 2 countryman 13 16 Scaling • These basic techniques are pretty simple • There are challenges – Scaling • as everything becomes digitized, how well do the processes scale? – Intelligent information extraction • I want information, not just a link to a place that might have that information. Ranked retrieval models • Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query • Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language • In principle, these are different options, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa 131 Scoring as the basis of ranked retrieval • We wish to return, in order, the documents most likely to be useful to the searcher • How can we rank-order the documents in the collection with respect to a query? • Assign a score – say in [0, 1] – to each document • This score measures how well document and query “match”. Term frequency - tf • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. • We want to use tf when computing querydocument match scores. But how? • Raw term frequency is not what we want: – A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. – But not 10 times more relevant. • Relevance does not increase proportionally with term frequency. NB: frequency = count in IR Log-frequency weighting • The log frequency weight of term t in d is wt,d 1 log 10 tf t,d , 0, if tf t,d 0 otherwise • 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. • Score for a document-query pair: sum over terms t in both q and d: score • The score is 0 if none of the query terms is present in the document. idf weight • dft is the document frequency of t: the number of documents that contain t – dft is an inverse measure of the informativeness of t – dft N (the number of documents) • We define the idf (inverse document frequency) of t by idf t log 10 ( N/df t ) – We use log (N/dft) instead of N/dft to “dampen” the effect of idf. Will turn out the base of the log is immaterial. tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. w t ,d (1 log tf t ,d ) log 10 ( N / dft ) • Best known weighting scheme in information retrieval – Note: the “-” in tf-idf is a hyphen, not a minus sign! – Alternative names: tf.idf, tf x idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection Final ranking of documents for a query 137 Documents as vectors • • • • So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero. From angles to cosines • The following two notions are equivalent. – Rank documents in decreasing order of the angle between query and document – Rank documents in increasing order of cosine(query,document) • Cosine is a monotonically decreasing function for the interval [0o, 180o] Therefore, a small angle results in a large cosine and a large angle results in a small cosine – just the behavior we need. Length normalization • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm: x 2 i xi2 • Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. – Long and short documents now have comparable weights Cosine for length-normalized vectors • For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized. 141 Cosine similarity illustrated This is for vectors representing only two words. Could you draw the vectors for your news examples? 142 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial? Summary – vector space ranking • Represent the query as a weighted tf-idf vector • Represent each document as a weighted tf-idf vector • Compute the cosine similarity score for the query vector and each document vector • Rank documents with respect to the query by score • Return the top K (e.g., K = 10) to the user Reality • The search engines use a variation of the tf-idf ranking. If they used a published algorithm, anyone could fool the search engine ranking process and results would not be good. • At Google, every bit of code written must be read by someone else. Everyone sees other code. Only the ranking code is secret. 145 Web Crawling First, What is Crawling A web crawler (aka a spider or a robot) is a program – Starts with one or more URL – the seed • Other URLs will be found in the pages pointed to by the seed URLs. They will be the starting point for further crawling – Uses the standard protocols for requesting a resource from a server • Requirements for respecting server policies • Politeness – Parses the resource obtained • Obtains additional URLs from the fetched page – Implements policies about duplicate content – Recognizes and eliminates duplicate or unwanted URLs – Adds found URLs to the queue and continues from the request to server step Crawler features • A crawler must be – Robust: Survive spider traps. Websites that fool a spider into fetching large or limitless numbers of pages within the domain. • Some deliberate; some errors in site design – Polite: Crawlers can interfere with the normal operation of a web site. Servers have policies, both implicit and explicit, about the allowed frequency of visits by crawlers. Responsible crawlers obey these. Others become recognized and rejected outright. Ref: Manning Introduction to Information Retrieval Crawler features • A crawler should be – Distributed: able to execute on multiple systems – Scalable: The architecture should allow additional machines to be added as needed – Efficient: Performance is a significant issue if crawling a large web – Useful: Quality standards should determine which pages to fetch – Fresh: Keep the results up-to-date by crawling pages repeatedly in some organized schedule – Extensible: Modular, well crafter architecture allows the crawler to expand to handle new formats, protocols, etc. Ref: Manning Introduction to Information Retrieval Robots.txt Protocol nearly as old as the web See www.rototstxt.org/robotstxt.html File: URL/robots.txt • Contains the access restrictions – Example: All robots (spiders/crawlers) User-agent: * Disallow: /yoursite/temp/ Robot named searchengine only User-agent: searchengine Disallow: Nothing disallowed Source: www.robotstxt.org/wc/norobots.html 150 Architecture of a Search Engine Ref: Manning Introduction to Information Retrieval Basic Crawl Architecture DNS WWW Doc FP’s robots filters URL set Content seen? URL filter Dup URL elim Parse Fetch URL Frontier Ref: Manning Introduction to Information Retrieval 152 Summary • We looked at a lot of topics, all of which are related to digital libraries. • The library metaphor for this class of webbased information system works if we understand all that a library is. • Translating the library model into a web-based information system touches on a lot of information handling issues. 153