Recap of the semester - Villanova Department of Computing Sciences

advertisement
Digital Libraries
Review and Reflection
1
Agenda
• Review the course syllabus and see how we
did
• Review the key points of the semester
• Reconsider some of the spot checks and
thinking points
• Update project status, prepare for next week’s
presentations
2
What is a library
• Our concept map exercise.
– Do we have a better understanding now about
what a modern library is?
– What are the essential characteristics?
• The 5S model for describing the components of a DL
• Streams, spaces, structures, societies, scenarios
3
The Etana digital library of archeological artifacts
Scenario
model
Society model
Archaeologist
General public
Services
Value added
Service Manager
Domain specific
Space model
Geographic space
Structure
model
Region
Stream
model
User interface
Text
*Partition
Video
Information Satisfaction
Metric space
Metadata
*Site
Repository building
*Sub-partition
Audio
Taxonomies
Spatial
Temporal
Artifact-specific
*Locus
Drawing
*Container
Photo
*Artifact
3D
4
In the beginning
• Vannevar Bush’s vision
– How far have we come?
– What did you notice about this article -- style or content or
background or anything else.
– Did the article suggest anything you would not want to see
happen?
5
Applying the model, informally
Looking back on your project plan
• Stream - what types of data? gif, jpg, avi, docx, pdf, html?
• Structure - How are the elements organized? Is there a
hierarchy? Are there multiple structures?
• Spaces - How will we index the items? How will we divide them
into related groups
• Scenarios - what services will we provide? What information do
we need to provide those services? What events might happen
that we need to plan for?
• Societies - who is the library intended to serve? Remember to
include agents and other processes as well as users.
Describing the content
• How to describe content
– Metadata
• Machine readable description of anything
• What description
– Machine readable requires standard descriptive
elements
• Dublin Core (http://dublincore.org/)
– International standard
– “a standard for cross-domain information resource description.”
– 15 descriptive elements
• Other metadata schemes
– IEEE-LOM
XML
•
•
•
•
•
XML is a markup language
XML describes features
There is no standard XML
Use XML to create a resource type
Separately develop software to interact
with the data described by the XML
codes.
Source: tutorial at w3school.com
Elements and attributes
• Use elements to describe data
• Use attributes to present information
that is not part of the data
–For example, the file type or some
other information that would be useful
in processing the data, but is not part
of the data.
Parts of an XML document
• Elements
– The components of an XML document
– Some contain other parts, some are empty
• Ex in HTML: “br” or “table” in XML “ingredient”
• Attributes
– Information about elements, not data
• Ex in HTML “src=” in XML “scale=”
The HTML examples are
familiar; the XML examples
are made up – dependent
on the specific XML
scheme used
• Entities
– Special characters or strings with pre-assigned meaning
• Ex in HTML &nbsp for non-breaking space
• PCDATA
– Parsed Character data: text that will be parsed and interpreted by the reader.
Tags and entities will be expanded and used in presentation.
• CDATA
– Character data: text that will not be parsed and interpreted. It will be displayed
exactly as provided.
Using XML - an example
Define the fields of a recipe collection:
<?xml version="1.0" encoding="ISO-8859-1"?>
<recipe>
<recipe-title> </recipe-title>
<ingredient-list>
<ingredient>
<ingredient-amount> </ingredient-amount>
<ingredient-name> </ingredient-name>
</ingredient>
</ingredient-list>
<directions>
ISO 8859 is a character set.
</directions>
</recipe>
See http://www.bbsinc.com/iso8859.html
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE recipe SYSTEM “recipe.dtd”>
External reference to DTD
<recipe>
<recipe-title> Meringue cookies</recipe-title>
Not the way that I
<ingredient-list>
want to see a recipe in
<ingredient>
a magazine!
<ingredient-amount>3 </ingredient-amount>
<ingredient-name> egg whites</ingredient-name>
What could we
</ingredient> <ingredient>
do with a large
<ingredient-amount> 1 cup</ingredient-amount>
collection of
<ingredient-name> sugar</ingredient-name>
such entries?
</ingredient> <ingredient>
<ingredient-amount>1 teaspoon </ingredient-amount>
<ingredient-name> vanilla</ingredient-name>
How would we
</ingredient> <ingredient>
get the
<ingredient-amount>2 cups </ingredient-amount>
information
<ingredient-name>mini chocolate chips </ingredient-name>
entered into a
</ingredient>
collection?
</ingredient-list>
<directions>Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place
in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off
and leave overnight.
</directions>
</recipe>
Vocabulary
• Given the need for processing, do you want free text
or restricted entries?
• Free text gives more flexibility for the person making
the entry
• Controlled vocabulary helps with
– Consistent processing
– Comparison between entries
• Controlled vocabulary limits
– Options for what is said
Dublin Core elements
see: http://dublincore.org/documents/dces/
•
•
•
•
•
•
•
•
Title
primarily responsible for
Creator Entity
making content of the resource
Subject - C
Description
making the resource
Publisher Entity
available
to content of
Contributor Contributor
the resource
Date
YYYY-MM-DD, ex.
Type - C Ex: collection, dataset,
•
•
•
•
•
•
•
What is needed to
display or operate the
resource.
Format - C
Unambiguous ID
Identifier
Source
Language Standards RFC 3066, ISO639
Relation Ref. to related resource
Coverage - C Space, time, jurisdiction.
Rights Rights Management information
event, image
C = controlled vocabulary recommended.
Dublin Core Terms
• An update to the original DC elements
– Adds the concept of range and domain
Each term has this minimal set of attributes:
• Name: A token appended to the URI of a DCMI namespace to
create the URI of the term.
• Label: The human-readable label assigned to the term.
• URI: The Uniform Resource Identifier used to uniquely
identify a term.
• Definition: A statement that represents the concept and
essential nature of the term.
• Type of Term: The type of term as described in the DCMI
Abstract Model [DCAM].
DC Terms
Additional Attributes possible:
•
•
•
•
•
•
•
•
•
•
•
•
Comment:
Additional information about the term or its application.
See:
Authoritative documentation related to the term.
References: A resource referenced in the Definition or Comment.
Refines:
A Property of which the described term is a Sub-Property.
Broader Than: A Class of which the described term is a Super-Class.
Narrower Than: A Class of which the described term is a Sub-Class.
Has Domain: A Class of which a resource described by the term is an
Instance.
Has Range:
A Class of which a value described by the term is an Instance.
Member Of: An enumerated set of resources (Vocabulary Encoding Scheme)
of which the term is a Member.
Instance Of: A Class of which the described term is an instance.
Version:
A specific historical description of a term.
Equivalent Property: A Property to which the described term is equivalent.
The DC Terms – from 15 to …
abstract, accessRights, accrualMethod, accrualPeriodicity, accrualPolicy,
alternative, audience, available, bibliographicCitation, conformsTo,
contributor, coverage, created, creator, date, dateAccepted,
dateCopyrighted, dateSubmitted, description, educationLevel, extent,
format, hasFormat, hasPart, hasVersion, identifier, instructionalMethod,
isFormatOf, isPartOf, isReferencedBy, isReplacedBy, isRequiredBy, issued,
isVersionOf, language, license, mediator, medium, modified,
provenance, publisher, references, relation, replaces, requires, rights,
rightsHolder, source, spatial, subject, tableOfContents, temporal, title,
type, valid
Source: www.cs.cornell.edu/courses/cs502/2002sp/.../lecture%202-26-02.ppt
Using Dublin Core
• Dublin core fields are attached to a resource
– separate metadata file associated with the main item
– as META tags within the html
Example:< META name = “DC.Title”
content = “Novi Belgii Novæque Angliæ:nec non partis
Virginiæ tabula multis in locis emendata ”
lang = “la”
>
18
Source: http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter7.html
Framework for Access Management
Legal and Technical Issues
• Legal: When is a resource available to digitize and
make available? What requirements exist for
controlling access?
• Technical: How do we control access to a resource
that is stored online?
– Policies
– Encoding
– Distribution limitations
source: http://www.unc.edu/~unclng/public-d.htm
Public Domain
• Definition:
A public domain work is a creative work that
is not protected by copyright and which may be freely used by
everyone. The reasons that the work is not protected include:
– (1) the term of copyright for the work has expired;
– (2) the author failed to satisfy statutory formalities to perfect the
copyright or
– (3) the work is a work of the U.S. Government.
Even in the case of public domain, the origin of the work should be noted if
possible.
Date of work
Protected from
Term
Created 1-1-78 or
after
When work is fixed in
tangible medium of
expression
Life + 70 years1(or if work of corporate
authorship, the shorter of 95 years from
publication, or 120 years from creation
Published before 1923
In public domain
None
Published 1923 - 63
When published with notice
28 years + could be renewed for 47
years, now extended by 20 years for a
total renewal of 67 years. If not so
renewed, now in public domain
Published from 1964 77
When published with notice
28 years for first term; now automatic
extension of 67 years for second term
Created before 1-1-78
but not published
1-1-78, the effective date of
the 1976 Act which eliminated
common law copyright
Life + 70 years or 12-31-2002,
whichever is greater
Created before
1-1-78 but published
between then and 1231-2002
1-1-78, the effective date of
the 1976 Act which eliminated
common law copyright
Life + 70 years or 12-31-2047
whichever is greater
Chart created by Lolly Gasaway. Updates at
http://www.unc.edu/~unclng/public-d.htm
Fair use
• No clear, easy answers.
• Checklist provided in the article is a good
guide to the issues.
• Link to the checklist:
http://www.nyu.edu/its/humanities/ninchguide/IV/
– Search for checklist
Moral rights
• Fair to the creator
– Keep the identity of the creator of the work
– Do not cut the work
– Generally, be considerate of the person (or
institution) that created the work.
Getting Permission
• With the best will in the world, getting the appropriate
permissions is not always easy.
– Identify who holds the rights
– Get in touch with the rights holder
– Get a suitable agreement to cover the needs of your use.
• Useful links:
http://www.loc.gov/copyright/
http://www.copylaw.com/new_articles/permission.html
http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter1/1b.html
http://www.k-state.edu/academicpersonnel/intprop/permission.htm
- Includes sample letters requesting permission
.
Checking copyright status
Source: NINCH Guide to Good
Practice. Chapter 4: Rights
Management
Considering
people depicted
in the work
Source: NINCH Guide
to Good Practice.
Chapter 4: Rights
Management
Copyright: Lauryn G.
Grant
Recall: Spot check
Part 1: 5-7 minutes
• Working in groups of two or three, construct a
scenario of when a work might be used.
–
–
–
–
Put it in your digital library?
Quote it in a paper?
Use it for an assignment?
Use it to bolster an argument?
• Be specific about the exact nature of the work.
Image? text? when created, who created it, etc.
Spot check
Part 2 – Rights management
10 – 15 minutes
• Pass your scenario to another group, and
receive one in turn.
• Make a decision about the rights management
issues related to the work you received.
• Write a brief summary of the issues involved
and what needs to be done
Technical issues
• Rights management is about policy and technical issues.
We looked at policy until now. Now for the technical
issues.
• Link the resource to the copyright statements
• Maintain that link when the resource is copied or used
• Approaches:
–
–
–
–
Steganography
Encryption
Digital Wrappers
Digital Watermarks
Issues in Encryption
• General cases for protection of controlled content:
Concern for passive listening, active interference.
– Listening: intruder gains information, may not be detected.
Effects indirect.
– Active interference
• Intruder may prevent delivery of the message to the intended
recipient.
• Intruder may substitute a fake message for the intended one
• Effects are direct and immediate
• Less likely in the case of digital library content
Message interception
Encoding
Method
Ciphertext
Eavesdropping
Decoding
Method
Masquerading
Original
message
Received
message
(Plain text)
(Plain text)
Intruder
Types of Encryption Methods
• Substitution
– Simple adjustment, Caesar’s cipher
• Each letter is replaced by one that is a fixed distance from it in the alphabet. A
becomes D, B becomes E, etc. At the end, wrap around, so X becomes A, Y becomes
B, Z becomes C.
• May have been confusing the fist time it was done, but it would not have taken long
to figure it out.
• Note the simple example at geocaching.com: No intention to hide or confuse. Just
keep a person from seeing too much information about the hide, unless the person
wants to see the help.
– Simple substitution of other characters for letters -- numbers, dancing men,
etc.
– More complex substitution. No pattern to the replacement scheme.
• See common cryptogram puzzles. These are usually made easier by showing the
spaces between the words. (For very modern version, see
http://www.cryptograms.org/)
Dancing Men????
• Arthur Conan Doyle: The Adventure of the
Dancing Men. A Sherlock Holmes Adventure.
“Speaking roughly, T, A, O, I, N, S, H, R, D, and L are the numerical order in
which letters occur; but T, A, O, and I are very nearly abreast of each other,
and it would be an endless task to try each combination until a meaning was
arrived at.”
Read the story online and see the images and analysis of the decoding at
http://camdenhouse.ignisart.com/canon/danc.htm
Types of encryption - 2
Hiding the text.
Definition from www.webopedia.com
• The wax tablet example
– message written on the base of the tablet and wax put over top of it
with another message on the wax
• Steganography: (ste-g&n-o´gr&-fē) (n.) The art and science of
hiding information by embedding messages within other,
seemingly harmless messages. Steganography works by
replacing bits of useless or unused data in regular computer
files (such as graphics, sound, text, HTML, or even floppy
disks ) with bits of different, invisible information. This hidden
information can be plain text, cipher text, or even images.
• Special software is needed for steganography, and there are freeware
versions available at any good download site.
• Can be used to insert identification into a file to track its source.
Types of encryption - 3
• Key-based shuffling
– Using a mnemonic to make the key easy to
remember.
• A machine to do the shuffling
A
A
B
B
C
C
D
D
What shuffling is used?
How would “CAB” look?
Monoalphabetic codes
• Any kind of substitution in which just one letter (or
other symbol) represents one letter from the original
alphabet is called monoalphabetic encoding.
– Such codes are easy to break. That is what you do when
you solve cryptograms.
– Frequency distribution of letters in normal text for a given
language are well known.
• “The twelve most frequently-used letters in the English language
are ETAOIN SHRDL, in that order.” -- http://www.cryptograms.org/
Letter distributions in English
A
7.81%
N
7.28%
TH
3.18
OU
0.72 THE
6.42
B
1.28
O
8.21
IN
1.54
IT
0.71 OF
4.02
C
2.93
P
2.15
ER
1.3
ES
0.69 AND
3.15
D
4.11
Q
0.14
RE
1.30
ST
0.68 TO
2.36
E
13.05
R
6.64
AN
1.08
OR
0.68 A
2.09
F
2.88
S
6.46
HE
1.08
NT
0.67 IN
1.77
G
1.39
T
9.02
AR
102
HI
0.68 THAT
1.25
H
5.85
U
2.77
EN
1.02
EA
0.64 IS
1.03
I
6.77
V
1.00
TI
1.02
VE
0.64 I
0.94
J
0.23
W
1.49
TE
0.98
CO
0.59 IT
0.93
K
0.42
X
0.30
AT
0.88
DE
0.55 FOR
0.77
L
3.60
Y
1.51
ON
0.84
RA
0.55 AS
0.76
M
2.62
Z
0.09
HA
0.84
RO
0.55 WITH
0.76
SOURCE: Tannenbaum Computer Networks 1981 Prentice Hall
Recall: Spot Check
• Go to the cryptogram site
(www.cryptograms.org) and solve a puzzle.
• Work in groups of two or three
• What information is helpful?
• What makes a puzzle hard?
• Suppose there were no spaces between the
words? Then what would you do?
Disguising frequencies
• First trick: use more than 26 symbols and use
several different symbols to represent the
same letter. The goal is to even out the
distribution.
• Ex. Use the letters plus the digits.
– 36 symbols
– Assign five symbols to the letter E, two to the
letter I, three to the letter N, two each to R and S.
Examples and breaking:
http://www.cs.trincoll.edu/~crypto/historical/vigenere.html
More complex
• Vigenere’s table
• Arrange all the letters of the alphabet 26 times, in
parallel columns, such that each column begins with
a different letter, first A, then B, etc.
• Encode each letter by using a different column for
each successive letter of the message.
• How to know which column to use? Use a keyword.
Vigenere Cypher
Write out the message
Write the key over the message, repeating as many times as
necessary.
To encrypt, use the ROW corresponding to the key letter and
find the intersection with the COLUMN of the plaintext letter.
Reverse to decrypt (Use the COLUMN of the key and scroll
down to the row indicated by the cyphertext. The
intersection shows the plaintext.
• Question -- how long should the keyword be? Long is hard
to remember, short repeats too often.
Recall: Spot Check
• Make up a key
• Encode a plain text message (not more than
20 characters, but at least 10)
• Pass the key and the encoded message to
another team.
• Decode the message you receive.
How secure?
• The Vigenere cipher looks really hard, but is not secure.
Since the keyword repeats, it is really just a bunch of
monoalphabetic codes. If you can figure out the length of
the keyword, you can do standard analysis.
• (It was considered unbreakable for nearly 300 years)
• Making it harder - instead of a regular arrangement of the
letter columns, scramble them in some arbitrary way.
– Makes decoding much more difficult, but also makes it difficult to
have the arrangement known to the people who are supposed to be
able to read the message.
Enigma
• Suppose we take a conversion for the first letter of the
message and a different mapping for the next letter and a
different mapping for the next letter …
• That is what we did with Vigenere
• Add additional encodings. Rotate from a fixed starting
point through 26 positions of the first set of columns,
then iterate a second set of columns. Now have 676
different mappings.
• To decode, must figure out the wiring inside each phase,
and the order in which they are arranged in the machine.
Enigma - 2
Encryption/Decryption Keys
• Problem is that you have to get the key to the receiver,
secretly and accurately.
• If you can get the key there, why not use the same method to send
the whole message? (Efficiency of scale)
• If the key is compromised without the communicators knowing it, the
transmissions are open.
• Exact working of the enigma machine:
– http://www.codesandciphers.org.uk/enigma/example1.htm
– How Polish mathematicians broke the enigma
– http://www.codesandciphers.org.uk/virtualbp/poles/poles.htm
Summary of encryption goals
•
•
•
•
•
•
•
High level of data protection
Simple to understand
Complex enough to deter intruders
Protection based on the key, not the algorithm
Economical to implement
Adaptable for various applications
Available at reasonable cost
Data Encryption Standard
• Complex sequence of transformations
– hardware implementations speed performance
– modifications have made it very secure
• Known algorithm
– security based on difficulty in discovering the key
• http://www.itl.nist.gov/fipspubs/fip46-2.htm
The Data Encryption Standard Illustrated
64 bit blocks, 64 bit key
Federal InformationProcessing Standards 46-2 http://www.itl.nist.gov/fipspubs/fip46-2.htm
INTERNET-LINKED COMPUTERS CHALLENGE DATA ENCRYPTION STANDARD
LOVELAND, COLORADO (June 18, 1997). Tens of thousands of computers, all across the U.S.
and Canada, linked together via the Internet in an unprecedented cooperative supercomputing
effort to decrypt a message encoded with the government-endorsed Data Encryption Standard
(DES).
Responding to a challenge, including a prize of $10,000, offered by RSA Data Security, Inc, the
DESCHALL effort successfully decoded RSADSI's secret message.
According to Rocke Verser, a contract programmer and consultant who developed the specialized
software in his spare time, "Tens of thousands of computers worked cooperatively on the challenge
in what is believed to be one of the largest supercomputing efforts ever undertaken outside of
government."
Using a technique called "brute-force", computers participating in the challenge simply began
trying every possible decryption key. There are over 72 quadrillion keys (72,057,594,037,927,936).
At the time the winning key was reported to RSADSI, the DESCHALL effort had searched almost
25% of the total. At its peak over the recent weekend, the DESCHALL effort was testing 7 billion
keys per second.
Public Key encryption
• Eliminates the need to deliver a key
• Two keys: one for encoding, one for decoding
• Known algorithm
– security based on security of the decoding key –
note, no key delivery problem
• Essential element:
– knowing the encoding key will not reveal the
decoding key
Effective Public Key Encryption
• Encoding method E and decoding method D are inverse functions
on message M:
– D(E(M)) = M
• Computational cost of E, D reasonable
• D cannot be determined from E, the algorithm, or any amount of
plaintext attack with any computationally feasible technique
• E cannot be broken without D (only D will accomplish the decoding)
• Any method that meets these criteria is a valid Public Key
Encryption technique
It all comes down to this:
• key used for decoding is dependent upon
the key used for encoding, but the
relationship cannot be determined in any
feasible computation or observation of
transmitted data
Rivest, Shamir, Adelman (RSA)
• Choose 2 large prime numbers, p and q, each
more than 100 digits
• Compute n=p*q and z=(p-1)*(q-1)
• Choose d, relatively prime to z
• Find e, such that e*d=1 mod (z)
– or e*d mod z = 1, if you prefer.
• This produces e and d, the two keys that define the E and D
methods.
Public Key encoding
• Convert M into a bit string
• Break the bit string into blocks, P, of size k
What was the
problem here?
– k is the largest integer such that 2k<n
– P corresponds to a binary value: 0<P<n
• Encoding method
– E = Compute C=Pe(mod n)
• Decoding method
– D = Compute P=Cd(mod n)
• e and n are published (public key)
• d is closely guarded and never needs to be disclosed
This version of the algorithm comes from Tannenbaum, Computer Networks.
An example:
•
•
•
•
•
•
•
•
P=7; q=11; n=77; z=60
d=13; e=37; k=6
Test message = CAT
Using A=1, etc and 5-bit representation:
– 00011 00001 10100
Since k=6, regroup the bits (arrange right to left so that any padding
needed will put 0's on the left and not change the value):
– 000000 110000 110100
decimal equivalent: 0 48 52 (three leading zeros added to fill the block)
Each of those raised to the power 37 (e) mod n: 0 27 24
Each of those values raised to the power 13 (d) mod n (convert
back to the original): 0 48 52
A practical note
• There is a lot more to security than encryption.
• Encryption coding is done by a few experts
• Understanding how the common encryption
algorithms work is useful in choosing the right
approach for your situation.
• Our interest here is in providing assurance that
access to protected resources will be limited to those
with legitimate rights.
On a practical note: PGP
• You can create your own real public and private
keys using PGP (Pretty Good Privacy)
• See the following Web site for full information.
• (MIT site - obsolete)
• http://www.pgpi.org/products/pgp/versions/freeware/
• http://www.freedownloadscenter.com/Utilities/Required_Files/PGP.
html
Issues
• Intruder vulnerability
– If an intruder intercepts a request from A for B’s public
key, the intruder can masquerade as B and receive
messages from B intended for A. The intruder can send
those same or different messages to B, pretending to be
A.
– Prevention requires authentication of the public key to
be used.
• Computational expense
– One approach is to use Public Key Encryption to send
the Key for use in DES, then use the faster DES to
transmit messages
Digital Signatures
• Some messages do not need to be encrypted,
but they do need to be authenticated: reliably
associated with the real sender
– Protect an individual against unauthorized access
to resources or misrepresentation of the
individual’s intentions
– Protect the receiver against repudiation of a
commitment by the originator
Digital Signature basic technique
Intention to send
Sender
A
E(Random Number)
where E is A’s public key
Message and
D(E(Random Number))
= Random Number,
decoded as only A
could do
Receiver
B
Public key encryption with implied signature
•
•
•
•
•
Add the requirement that E(D(M)) = M
Sender A has encoding key EA, decoding key DA
Intended receiver has encoding (public) key EB.
A produces EB(DA(M))
Receiver calculates EA(DB(EB(DA(M))))
– Result is M, but also establishes that only A could
have encoded M
Digital Signature Standard (DSS)
• Verifies that the message came from the
specified source and also that the message
has not been modified
• More complexity than simple encoding of a
random number, but less than encrypting the
entire message
• Message is not encoded. An authentication
code is appended to it.
Digital Signature – SHA
(Secure Hash Algorithm)
FIPS Pub 186 - Digital Signature Standard http://www.itl.nist.gov/fipspubs/fip186.htm
Encryption summary
• Problems
– intruders can obtain sensitive information
– intruder can interfere with correct information
exchange
• Solution
– disguise messages so an intruder will not be able
to obtain the contents or replace legitimate
messages with others
Important methods
• DES
– fast, reasonably good encryption
– key distribution problem
• Public Key Encryption
– more secure
• based on the difficulty of factoring very large numbers
– no key distribution problem
– computationally intense
Digital signatures
• Authenticate messages so the sender cannot
repudiate the message later
• Protect messages from changes during
transmission or at the receiver’s site
• Useful when the contents do not need
encryption, but the contents must be accurate
and correctly associated with the sender
Recall: Your turn
• You receive this: 000000011011011000
– There will not be any convenient spacing between
the blocks in the transmitted message. That is not
necessary.
• It has been encoded with your public key. You
have the private key, d = 13 n = 77
• Show the decoding process
A practical note
• There is a lot more to security than encryption.
• Encryption coding is done by a few experts
• Understanding how the common encryption
algorithms work is useful in choosing the right
approach for your situation.
• Our interest here is in providing assurance that
access to protected resources will be limited to those
with legitimate rights.
Understanding Quality in a DL
• Quality indicators: proposed descriptions of
quantities or observable variables that may be
related to quality
– “measures” = stronger term. Requires validation
– Gonçalves et al provide analysis of quality conditions and recommend
specific quantities to be used.
• Dimensions of quality
• Proposed indicators
• Application to DL concerns
Getting the data
• Where does the data come from?
– Logging
– Surveys
– Focus Groups
• Know what information is needed, then choose the
method most likely to provide the data.
– More about the sources of data after we see what we
need to know.
What are we looking for?
• What characteristics of a digital library raise
questions about quality?
–
–
–
–
–
–
Data objects
Metadata
Collection
Catalog
Repository
Services
• What characteristics do we want each of those to
have?
Dimensions of Quality
• Digital Object
–
–
–
–
–
–
–
Accessibility
Pertinence
Preservability
Relevance
Similarity
Significance
Timeliness
• Metadata Specification
– Accuracy
– Completeness
– Conformance
• Collection
– Completeness
• Catalog
– Completeness
– Consistency
• Repository (may hold
more than one collection)
– Completeness
– Consistency
• Services
–
–
–
–
–
–
Composability
Efficiency
Effectiveness
Extensibility
Reusability
Reliability
Recall: Spot check
• For your digital library project,
– how will you define quality for each of these
factors?
•
•
•
•
•
•
Data objects
Metadata
Collection
Catalog
Repository
Services
What is your intention, or your goals, for each of
these?
I will ask each group to present two of
these (briefly), but prepare all of them.
Information need - Digital Objects
• Accessibility
–
–
–
–
What collection?
# of structured streams
Rights management metadata
Communities to be served
• Relevance
– Feature frequency
– Inverse document frequency
– Document size
– Document structure
– Query size
– Collection size
• Significance
– Citation/link patterns
• Preservability
–
–
–
–
Fidelity (lossiness)
Migration cost
Digital object complexity
Stream formats
• Pertinence
– Context
– Information content
– Information need
• Similarity
– All the same features as in relevance
– Also: citation/link patterns
• Timeliness
– Age
– Time of latest citation
– Collection freshness
Information need - Metadata
Specification
• Accuracy
– Accurate attributes
– # attributes in the record
• Completeness
– Missing attributes
– Schema size
• Conformance
– Conformant attributes
– Schema size
Information - Collection and Catalog
• Completeness of the Collection
– Collection size
– Size of an “ideal” collection
• Completeness of the Catalog
– # of digital objects with no metadata
• Item level metadata
– Size of the collection
• Catalog Consistency
– # of metadata specifications per digital object
Information about the Repository
• Completeness
– # of collections
• Consistency
– # of collections
– Catalog/collection match
• How well do the catalogs match the collections?
• Are the catalogs for all the collections at the same
level of detail?
Service Information Need
• Composability (ability to be
combined to form new services)
– Extensibility
– Reusability
• Efficiency
– Response time
• Effectiveness
– Precision/recall (of search)
– Classification
• Extensibility
– # extended services
– # services in the DL
– # lines of code per service
manager
• Reusability
– # reused services
– # services in the DL
– # lines of code per service
manager
• Reliability
– # service failures
– # accesses
Making more concrete
• Each of the measures listed gives an idea of
the information need
• Exactly what do we measure?
• How do we combine numbers obtained to get
a usable result?
• Following pages describe specific measures
and formulas for combining those.
Solidifying Pertinence
• How do we measure something like
pertinence?
• Relation between the information content of a
digital object and the need of the user
• Depends on the user’s situation -background, current context, etc.
Preservability
• Property of a digital object that describes its
state relative to changes in hardware and
software, representation format standards
– Ex new recording technologies (replacement of
VHS video tapes by DVDs)
– New versions of software such as Word or Acrobat
– New image standards such as JPEG 2000
Digital preservation techniques
• Migration
Most commonly used
– Transform from one format to another
• Ex. Open the document in one format and save in another or do an automated
transformation
• Emulation
– Reproducing the effect of the environment originally used to display the material
• Keep an old version of the software, or have new software that can read the old format
• Wrapping
– Keep the original format, but add enough human-readable metadata so that it can
be decoded in the future
• Note that the material is not directly usable
• Refreshing
– Copy the stream of bits from one location to another
• Particularly suitable for guarding against the physical deterioration of the medium
Preservability issues
• Obsolescence
– How out of date is the digital object?
• Many versions of the software?
• Old storage media?
– Difficult to migrate
Miniclip Internet Archive
• Appropriate tools? Expertise?
• Fidelity
– How different is the migrated version from the original?
– Distortion = loss of information
• Preservability of a digital object in a digital library is a function of
the fidelity of the migration and the obsolescence of the object
• Preservability(doi, dl) = (fidelity of migrating (doi, formatx, formaty),
obsolescence(doi, dl))
– Two values to reflect the two dimensions of the concept: fidelity and
obsolescence
Preservability factors
• Capital direct costs
– Software
• Developing software to create new versions of the object or
obtaining licenses for new versions of the original software
– Hardware
• For processing the migration and for storing the results
• Indirect operating costs
–
–
–
–
Monitoring digital objects for migration needs
Maintaining up-to-date intellectual property rights
Storage
Staff training
Significance
• Significance is an expression of the absolute
usefulness of a given digital object,
independent of particular user needs.
• Citation records of objects in digital libraries
offer one measure of significance. (This
disadvantages the most recently obtained
objects, since they have had less time to be
cited by others.)
Look at ACM DL and the citation counts, for example.
Life Cycle and Quality
• The quality indicators relate to the core components of a digital
library – creation, use, finding, distribution.
• Creation
– Authoring, modifying
– Describing, Organizing, Indexing
• Use
– Access, filtering
• Finding (seeking)
– Searching, Browsing, recommending
• Distribution
– Storing
– Archiving
– Networking
Quality and Lifecycle - 2
Quality and Life Cycle - 3
• Note that some elements repeat
– Timeliness is relevant to the content and to the
metadata that describes the content
– Accessibility affects both usefulness and
distribution.
Digital Library User Interface and
Usability
Goals:
• Discover elements of good interface design for
digital libraries of various sorts
• Consider examples from DL usability evaluation as
sources of insight.
• Look at the distinct requirements of interfaces to
libraries of video and audio files
Methods of evaluation
• Surveys
• Target user groups
– Focus groups from the intended audiences
– another recommendation: faux focus groups
• When it is not practical to do a real focus group for a while, the
developers do some role playing, pretend to be users
• What do you think of this approach?
• Ethnographic studies
– Audio/video taped sessions of users
– Analysis of feedback and comments
• Demographic analysis of beta tester registration data
• Log analysis
Hill 97
Recall: Your plans
• How will you evaluate the usability of your
digital library?
– What is ideal?
– What is practical?
– What do you plan to do?
Take two or three minutes to think about it and jot notes to yourself. If you are
part of a team, do this on your own. (You can compare your team’s responses
later.) Then we will hear from each of you.
Evaluation
• Evaluation for any purpose has two major
components
– Formative
• During development, spot check how things are progressing
• Identify problems that may prevent goals from being achieved
• Make adjustments to avoid the problems and get the project
back on track
– Summative
• After development, see how well it all came out
• Lessons learned may be applicable to future projects, but are
too late to affect the current one.
• Needed for reporting back to project sponsors on success of the
work.
Recall: Spot check
• Divide into pairs so that each member of the pair is
from a different project.
• One member of the group looks at the other person’s
project.
– Try to use it
– Give feedback on usability
• Switch to the other project and repeat
• About 5 - 7 minutes on each project
• What kind of evaluation was this?
Did this exercise lead to any
changes in your project
plans?
Usability evaluation
• Lab-based formative evaluation
–
–
–
–
Real and representative users
Benchmark tasks
Qualitative and quantitative data
Leads to redesign where needed
• After deployment
– Real users doing real tasks in daily work
– Summative with respect to the deployed system
– Useful for later versions
Sample evaluation
• Digital library evaluated by a usability expert,
results reported.
– See references: Hartson and Perez-Quiñones
– NCSTRL (“Networked Computer Science Technical
Reference Library”)
– Still present at ncstrl.org We looked at this in some detail and
gathered some guidelines.
Categories of Problems
• General to most
applications, GUIs
– Wording
– Consistency
– Graphic layout and
organization
– User’s model of the
system
• Digital Library
functionality
–
–
–
–
Browsing
Filtering
Searching
Document submission
functions
Hartson 04
Guidelines discovered
• Standardize terminology and check it carefully
• Clearly indicate where the user is in the overall system
• Use terms that are meaningful to users without explanation whenever
possible. Resist presenting data that is not useful for user purposes.
• Label for the user, not the developer
• Label results appropriately, even scrupulously, for their real meaning.
• Cosmetic consideration can have a positive affect on user’s impression
of the site.
• Organize task interfaces by categories to present a structured system
model and reduce cognitive workload.
99
Guidelines, continued
• Consider the implications of placement and association of
graphical elements.
• Any application should have a home page that explains what
the site is about and gives the user a sense of the overall site
capability and use.
• Usability suggestion: combine search, browse, filter into one
selection and navigation facility.
• Allow user activity that will serve user needs. Try to find out
what users want before making decisions about services
offered
• Link directly to the service offered without any intermediate
pages unless needed in support of the service.
100
Recall: Spot check
• Given all those points (in the bold and
different color type), pick one that strikes you
as especially relevant for your project and say
how you will address it.
– Do consider all of them when working on your
project – just pick one to talk about now.
Source: Lee 02
Video Digital Libraries
• Video digital libraries offer more
challenges for interface design
– Information attributes are more
complex
• Visual, audio, other media
– Indicators and controlling widgets
• Start, stop, reverse, jump to
beginning/end, seek a particular frame or a
frame with a specified characteristic
Note: Youtube started in
2005. This material was
published in 2002. It is
of interest to see how
youtube compares with
the desired
characteristics of a video
library
Source: Lee 02
Summarizing stages
of information
seeking and the
interface elements
that support them
as described in four
researchers’ work.
Recall: A scenario
• Directory with 100 (or 1000 or…) video files.
• No information except the file name.
– Maybe reasonable name, but not very descriptive
• You want to find a particular clip from a party or a
ceremony or some other event.
• What are your options?
• What would you like to have available?
Spend a bit of time now talking about this.
Source: Lee 02
Video Abstraction
• Levels to present: (from Shneiderman 98)
– Overview first
– Zoom and Filter
– Details on Demand
• Example levels (from Christel 97)
–
–
–
–
Title: text format, very high level overview
Poster frame: single frame taken from the video
Filmstrip: a set of frames taken from the video
Skim: multiple significant bits of video sequences
• Time reference
– Significant in video
– Options include simple timeline, text specification of time of the
current frame, depth of browsing unit
Source: Lee 02
Keyframe browsing
• Extract a set of frames from the video
– Display each as a still image
– Link each to play the video from that point
• Selection is not random
– Video analysis allows recognition
• Sudden change of camera shot
• Scenes with motion or largely stationary
– Video indexing based on frame-by-frame image
comparison
• Similar to thumbnail browsing of image collections
Keyframe
extraction for
display on
browsing
interface
Source: Lee 02
Source: Lee 02
Keyframe extraction
• Manual
– Owner or editor explicitly selects the frames to be used as
index elements
• Automatic
– Subsampling - select from regular intervals
• Easy, but may not be the best representation
– Automatic segmentation - break the video into meaningful
chunks and sample each
• Shot boundary detection - note switch from one camera to
another, or distinct events from one camera
Metadata Harvesting
Interoperable digital collections
Distributed libraries
• The reality in most digital libraries is that no one
location has all the materials that may be of interest.
• It is often more efficient to allow a number of sites
each to retain some of the materials.
• How can we assure clients that they will see all
relevant resources, regardless of which library they
search?
What I was doing this week
• Discussion and summary of the NSDL program
experience.
– Questions about the appropriateness of the library
metaphor
– These apply to any digital library. Is that the right term?
• Interoperability relates to inter-library loan,
common catalog availability, branch libraries,
community specific libraries, etc.
111
Two basic approaches
• One service provider with access to resources
stored in multiple locations
– Information about all the resources located at the
service provider.
– Services (DL scenarios) use the information to provide
connections to resources at multiple locations
• Distributed services
– Information kept with the resources
– Services, local to each collection, interact with other
collection sites
Distributed Resources
Multiple Services
Approach 1 - One service
Data provider
provider gathers information
about data and uses it to provide
services
Data provider
Data provider
Service provider -search, browse,
compare, etc.
Data provider
Data provider
Distributed data and services
Approach 2: Each
system is both a
data repository
and a service
provider. Services
query other data
providers as
needed.
Search,
browse
Search,
browse,
compare
Hybrid systems
Each server likely to have its own clients. Difference is
whether the information exchange is periodic or ad hoc
Data provider
Data/ service
provider
Data/ service
provider
Service provider -search, browse,
compare, etc.
Data/ service
provider
Data provider
Open Archives Initiative (OAI)
• Web-based
– Uses HTTP to communicate between sites
• Centralized server
– Services provided from a site that has already
gathered the information it needs for those
services from a distributed collection of sites.
http://www.openarchives.org/pmh/
OAI PMH
• Interoperability through Metadata Exchange
• The Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) is a low-barrier mechanism for
repository interoperability. Data Providers are
repositories that expose structured metadata via OAIPMH. Service Providers then make OAI-PMH service
requests to harvest that metadata. OAI-PMH is a set of
six verbs or services that are invoked within HTTP.
OAI - ORE
http://www.openarchives.org/ore/
• Aggregations of Web Resources
• Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards
for the description and exchange of aggregations of Web resources. These
aggregations, sometimes called compound digital objects, may combine distributed
resources with multiple media types including text, images, data, and video. The
goal of these standards is to expose the rich content in these aggregations to
applications that support authoring, deposit, exchange, visualization, reuse, and
preservation. Although a motivating use case for the work is the changing nature of
scholarship and scholarly communication, and the need for cyberinfrastructure to
support that scholarship, the intent of the effort is to develop standards that
generalize across all web-based information including the increasing popular social
networks of “web 2.0”.
http://www.openarchives.org
ore/1.0/primer.html#Example
OAI - ORE
• ORE allows aggregation of related web
pages to form a logical unit
– The representation allows access to all of the
components of a resource at once.
Open Archives Initiative Protocol for
Metadata Harvesting -- OAI-PMH
Implemented as CGI, ASP,
PHP, or other
HTTP req
(OAI verb)
OAI
OAI
HTTP resp
(XML)
Metadata
Provider
Any system may serve as a harvester, repository, or both
Harvester
Repository
OAI PMH defines an
interface between the
Harvester and any
number of
Repositories
Service
Provider
OAI - PMH components
Service
Providersand
Data Providers
Requests
and
Responses
http://www.oaforum.org/tutorial/english/page3.htm#section3
Records
• Metadata of a resource.
• Three parts
– Header (required)
•
•
•
•
Identifier (required: 1 only)
Datestamp (required: 1 only)
setSpec elements (optional: 0, 1, or more)
Status attribute for deleted item
– Metadata (required)
• XML encoded metadata with root tag, namespace
• Repositories must support Dublin Core, other formats optional
– “About” statement (optional)
• Right statements
• Provenance statements
Interoperability
• The goal: communication, without human
intervention, between information sources
– Books that “talk to each other”
• Live links for references
• Knowledge of how to find relevant resources when
needed
• Ability to query other information locations
Protocols
• Precise rules for interactions between independent
processes
– Format of the messages
• Both structure and content
– Specified behavior in response to specific messages
• Many ways to accomplish the same result, but both
sides must have the same understanding of the rules
of engagement.
Information Retrieval
• This was the most recent part of the class, so
will not occupy a prime position in this review
• Digital libraries may be considered a sub
discipline of IR
125
Information Retrieval
• The basic problem
– Given a collection of materials, devise methods for
efficiently finding what is wanted.
• Issues:
– Precision
• Getting only materials that match the search need
– Recall
• Getting all the materials that match the search need
– These are usually contrary requirements
126
Representing the collection
• Efficient access to content depends on the
information available about the materials
• Indexing
– Producing pointers to materials
• pointers must be accurate and complete
• pointers must be searchable, efficiently
127
One more time: the scale of the problem
Yotta
Soon most everything will be
recorded and indexed
Most bytes will never be seen by
humans.
Data summarization,
trend detection
anomaly detection
are key technologies
These require
algorithms, data and
knowledge
representation, and
knowledge of the
domain
Exa
All Books
MultiMedia
See Mike Lesk:
How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Slide source Jim Gray – Microsoft Research (modified)
Zetta
Everything
Recorded !
All books
(words)
A movie
A Photo
A Book
Peta
Tera
Giga
Mega
Kilo
Inverted index construction
Documents to
be indexed.
Token stream.
Modified
tokens.
Friends, Romans, countrymen.
Tokenizer
Romans
Countrymen
friend
roman
countryman
Linguistic modules
Stop words, stemming,
capitalization, cases, etc.
Inverted index.
Friends
Indexer
friend
2
4
roman
1
2
countryman
13
16
Scaling
• These basic techniques are pretty simple
• There are challenges
– Scaling
• as everything becomes digitized, how well do the
processes scale?
– Intelligent information extraction
• I want information, not just a link to a place that might
have that information.
Ranked retrieval models
• Rather than a set of documents satisfying a query expression,
in ranked retrieval models, the system returns an ordering
over the (top) documents in the collection with respect to a
query
• Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words in
a human language
• In principle, these are different options, but in practice,
ranked retrieval models have normally been associated with
free text queries and vice versa
131
Scoring as the basis of ranked retrieval
• We wish to return, in order, the documents most
likely to be useful to the searcher
• How can we rank-order the documents in the
collection with respect to a query?
• Assign a score – say in [0, 1] – to each document
• This score measures how well document and
query “match”.
Term frequency - tf
• The term frequency tft,d of term t in document
d is defined as the number of times that t
occurs in d.
• We want to use tf when computing querydocument match scores. But how?
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is
more relevant than a document with 1 occurrence
of the term.
– But not 10 times more relevant.
• Relevance does not increase proportionally
with term frequency.
NB: frequency = count in IR
Log-frequency weighting
• The log frequency weight of term t in d is
wt,d
1  log 10 tf t,d ,

0,

if tf t,d  0
otherwise
• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4,
etc.
• Score for a document-query pair: sum over
terms t in both q and d:
score
• The score is 0 if none of the query terms is
present in the document.
idf weight
• dft is the document frequency of t: the number of
documents that contain t
– dft is an inverse measure of the informativeness of t
– dft  N (the number of documents)
• We define the idf (inverse document frequency)
of t by
idf t  log 10 ( N/df t )
– We use log (N/dft) instead of N/dft to “dampen” the
effect of idf.
Will turn out the base of the log is immaterial.
tf-idf weighting
• The tf-idf weight of a term is the product of its
tf weight and its idf weight.
w t ,d  (1  log tf t ,d )  log 10 ( N / dft )
• Best known weighting scheme in information
retrieval
– Note: the “-” in tf-idf is a hyphen, not a minus sign!
– Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences
within a document
• Increases with the rarity of the term in the
collection
Final ranking of documents for a query
137
Documents as vectors
•
•
•
•
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of
dimensions when you apply this to a web
search engine
These are very sparse vectors - most
entries are zero.
From angles to cosines
• The following two notions are equivalent.
– Rank documents in decreasing order of the angle
between query and document
– Rank documents in increasing order of
cosine(query,document)
• Cosine is a monotonically decreasing function
for the interval [0o, 180o]
Therefore, a small angle results in a large cosine
and a large angle results in a small cosine – just
the behavior we need.
Length normalization
• A vector can be (length-) normalized by dividing
each of its components by its length – for this we
use the L2 norm:

x 2  i xi2
• Dividing a vector by its L2 norm makes it a unit
(length) vector (on surface of unit hypersphere)
• Effect on the two documents d and d′ (d appended
to itself) from earlier slide: they have identical
vectors after length-normalization.
– Long and short documents now have comparable
weights
Cosine for length-normalized vectors
• For length-normalized vectors, cosine
similarity is simply the dot product (or
scalar product):
for q, d length-normalized.
141
Cosine similarity illustrated
This is for vectors
representing only
two words. Could
you draw the vectors
for your news
examples?
142
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Summary – vector space ranking
• Represent the query as a weighted tf-idf vector
• Represent each document as a weighted tf-idf
vector
• Compute the cosine similarity score for the query
vector and each document vector
• Rank documents with respect to the query by
score
• Return the top K (e.g., K = 10) to the user
Reality
• The search engines use a variation of the tf-idf
ranking. If they used a published algorithm,
anyone could fool the search engine ranking
process and results would not be good.
• At Google, every bit of code written must be
read by someone else. Everyone sees other
code. Only the ranking code is secret.
145
Web Crawling
First, What is Crawling
A web crawler (aka a spider or a robot) is a program
– Starts with one or more URL – the seed
• Other URLs will be found in the pages pointed to by the seed URLs. They will
be the starting point for further crawling
– Uses the standard protocols for requesting a resource from a server
• Requirements for respecting server policies
• Politeness
– Parses the resource obtained
• Obtains additional URLs from the fetched page
– Implements policies about duplicate content
– Recognizes and eliminates duplicate or unwanted URLs
– Adds found URLs to the queue and continues from the request to
server step
Crawler features
• A crawler must be
– Robust: Survive spider traps. Websites that fool a spider
into fetching large or limitless numbers of pages within
the domain.
• Some deliberate; some errors in site design
– Polite: Crawlers can interfere with the normal operation
of a web site. Servers have policies, both implicit and
explicit, about the allowed frequency of visits by
crawlers. Responsible crawlers obey these. Others
become recognized and rejected outright.
Ref: Manning Introduction to Information Retrieval
Crawler features
• A crawler should be
– Distributed: able to execute on multiple systems
– Scalable: The architecture should allow additional machines to be added as
needed
– Efficient: Performance is a significant issue if crawling a large web
– Useful: Quality standards should determine which pages to fetch
– Fresh: Keep the results up-to-date by crawling pages repeatedly in some
organized schedule
– Extensible: Modular, well crafter architecture allows the crawler to expand
to handle new formats, protocols, etc.
Ref: Manning Introduction to Information Retrieval
Robots.txt
Protocol nearly as old as the web
See www.rototstxt.org/robotstxt.html
File: URL/robots.txt
• Contains the access restrictions
– Example:
All robots (spiders/crawlers)
User-agent: *
Disallow: /yoursite/temp/
Robot named
searchengine only
User-agent: searchengine
Disallow:
Nothing disallowed
Source: www.robotstxt.org/wc/norobots.html
150
Architecture of a Search Engine
Ref: Manning Introduction to
Information Retrieval
Basic Crawl Architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Ref: Manning Introduction to Information Retrieval
152
Summary
• We looked at a lot of topics, all of which are
related to digital libraries.
• The library metaphor for this class of webbased information system works if we
understand all that a library is.
• Translating the library model into a web-based
information system touches on a lot of
information handling issues.
153
Download