Preservation Risk Management for Web Resources The Prism Project.

advertisement
Preservation Risk Management for Web Resources
The Prism Project.
Presented by Aravind Elango
What is the Prism project about ?
A project exploring technologies, which would be the basis for a suite of, protocols to support
risk based management and evaluation of web resources.
Current trends of content acquisition in digital libraries
 Libraries spend more than 40% of their budget on licenses for science journals, which might not
be preserved or maintained in the long term. Also maintain paper copies for preservation.
 Many university libraries use content from open libraries, which do not have any contract for
access or preservation.
The fragility of the content repository (WWW)
 Only 50% of the IP addresses from the year before are valid in the next year.
 URL’s on the average are valid for 44 days.
 Viruses are malicious codes wipe out a significant number of websites.
Current effort of libraries to protect content on the WWW
 Partnerships with publishers
1. Many publishers like Elsevier Science & OLLC have already formulated policies for
preserving digital content.
2. Mellon foundation is funding research on how to better preserve electronic journals.
 Providing guidelines for maintaining websites
1. W3C provides very good guidance for resource management (by use of standardized formats
and backward compatible software), but not for content stability or database management.
 Government preservation efforts
1. US national commission on libraries has recognized that web resources are part of national
resources in 2001. Most of these declarations seem to be outlines and not concrete action
plans.
2. Content from US federal agencies are now turned to NARA for long term record keeping.
3. Australian efforts for preservation are more concrete and have already received positive
responses from resource owners.
Crawling the WWW for preservation efforts
 Voluntary groups have been crawling the web and preserving the content at various time
intervals.
 Although successful, crawlers are challenged for the following reasons
1. Hidden web
2. Legal issues
3. Is what is important being stored?
4. Volume :O
5. Dynamic Content
6. Changing software standards.
Where does Prism play a part?
When library uses content on the WWW that it does not manage and wants to be better
informed about the risk factors associated with the content, guidelines provided by Prism would be
valuable. What is designed is software, which will keep checking the status of the resource according to
some established guidelines. Which resource to check and which status condition would invoke action is
left to the administrators of the software.
Features to analyze the stability and maintenance of a web resource
The software could crawl web pages and the downloaded pages could be analyzed manually or
automatically.
Manual checks could analyze the following
 Tidiness of HTML formatting
 Conformance to standards such as XHTML.
 Document Structure. Good structure indicates better management.
 Presence of metadata in a web page indicates good management.
Automatic checks could analyze the following.
 HTTP response codes
 Good response times indicate stability and responsiveness
 Type of domain name
 Link volatility
What is a website?
In simple terms, it could be a collection of web pages with the same base URL or administered
by the same management.
Analyzing the link structure of websites gives lot of information about the stability of web pages and the
changes or them being isolated from the collection. It is useful to create a directed graph representation of
the website and apply graph theory based algorithms to analyze risk.
It is also useful to analyze the ecology of a website
 The hardware and software environment it is in
 Administrative services such as analyzing the reputation of the ISP handling the site
 Network configuration
 Backup policies
 Vulnerability to physical accidents
Plans for Prism
 Monitor the changes to the protected website
 Recommend changes to WebPages to ensure longevity
 Ensure that the compliance standards for the website are followed
References:
1) Anne R. Kenney et al., "Preservation Risk Management for Web Resources: Virtual Remote Control in
Cornell's Project Prism", D-Lib Magazine, 8(1), 2002.
http://www.dlib.org/dlib/january02/kenney/01kenney.html
Download