presentation source

advertisement
Archiving Web Content
CE Course #285
Sheraton NY Hotel & Towers
Sunday, June 8, 2003
Introductions
Meet today’s panelists
Today’s Panelists




Barry Abisch - The Journal News
Olivia Kobelt – Christian Science Monitor
Mark Stencel – Washingtonpost.com
Janine Yagielski – CNN.com
Agenda









Introductions & session overview
Technology
Workflow and processes
Brainstorming
Break
Brainstorming recap
Role of the librarian
Building the business case
Closing comments & session evaluation
Technology
Panelist: Janine Yagielski
Technology Overview




What can be archived?
Preparing content to be archived
Storing and serving archived content
Searching archived content
What can be archived?





Overview of file formats (handout)
Dynamic and static content
Archiving presentation as well as content
Archiving secondary information about online
content (traffic information)
Challenges of changing technologies
Technology
Overview of File Formats
(handout)




Text formats
Image/graphic file formats
Video formats
Other definitions
Technology
Static and Dynamic Content
Static Content: Content that once posted does not change.
Example: Simple story or information page
Dynamic Content: Constantly changing content
Example 1: Weather data, Stock Prices
Example 2: Election Results, Sports Scores (fixed end point)
Technology
Static and Dynamic Content
Hybrid: Changes occasionally but does not have a predictable
updating schedule or end point
Example: Top story with multiple and significant updates
Example: Home page or section page
Technology
Archiving Presentation and Content
CNN.com has built an internal
system to archive some
presentation





Home Page, US, World, Politics,
International Edition
One week of pages
Every 30 minutes
Perl Script
Size of archive: 55.4 MB
Technology
Archiving Secondary Information about Online
Content (traffic)
CNN.com has extensive Webstats reporting system that parses and
archives the information from Web server logs.
Simple statistics: Page Views, hits (back to 1996)
Advanced statistics: Unique users, time spent, IP address, OS, browser
Real Time Monitor: tracks click through rates of links
 Home and US pages
 One week of info on links
 Tracks average and peak for links
Technology
Challenges of Changing Technology
Interdependencies of the Web make it difficult to maintain old
content when optimizing for new content.
Examples: .shtml pages, Vivo video, some Shockwave, other antiquated
multimedia technology based on plug-ins
Technology
Preparing Content to be Archived
Directory Structure/Database
Key to consistency and automation in subject specific archives.
cnn.com/2003/WORLD/meast/06/02/sprj.nitop.political.council/
Slugs conventions
Provide additional method of automation archiving
Examples: sprj; sprj.nitop; .ap
Technology
Preparing Content to be Archived
Content Management System
Imposes and uses directory structure to prepare content for publication,
syndication and in some cases archiving and searching
Metadata in stories on publish
<meta name="DESCRIPTION" content="A U.S. soldier was killed and five
were wounded early Thursday in the Iraqi city of Fallujah, the U.S.
Central Command announced -- the latest casualties in the city, which
has become a center of resistance.">
<meta name="AUTHOR" content="">
<meta name="SECTION" content="WORLD">
<meta name="SUBSECTION" content="meast">
<meta name="DATE" content="2003-06-05 05:22:20">
Technology
Preparing Content to be Archived
XML (Extensible Markup Language)
CNN.com produces a XML file with every story for site search. We also
produce XML feeds of story headlines and other data sent to
syndication partners.
Metadata and XML for Multimedia
CNN.com is looking into way to insert metadata and produce XML feeds
of non-traditional stories. Currently only an internal and manual process
of archiving the location and subject of interactive (pop-up) content.
Technology
Storing and Serving Archived Content
Simple storage of content




Content servers
Burn to CD
Web servers (internal and external even if not served)
Tape backup
Serving to internal users




Image query
Directory browsing on the inside Web servers
Content purged from outside available (AP, partner stories)
Limited space on internal Web server (36 GB)
Technology
Serving to All External Users
All unique URLs published on
CNN.com from the launch of the site
are still available, unless there was
an editorial decision to remove or
redirect a URL.
CNN video is hosted by AOL. Because of
changes in hosting and capacity of video
servers. Not all previous video streams are
available.
Technology
Serving to All External Users
Web servers/NFS Server
Hardware: Sun and Intel (running Linux)
Cost: $10,000-$15,000 (Sun), $5,000 (Intel)
Capacity: Storage capacity expanded by adding additional hard drives.
Serving capacity varies by content. HTML -- 25K hits/minute; images,
style sheets -- 60-70K hits/minute
Video Servers
Hardware: Reconfigured and video dedicated Web server
Cost : $1,500-$3,000
Capacity: Depends on length and size of video and disk space
Technology
Serving to Select Users
Registration




E-mail newsletters
New e-mail alerts
Backend Oracle database
JSP’s dynamically served
Subscription


Video
Real Networks handles CNN.com’s subscriber authentication
Technology
Searching Archived Content
Searching for internal users
Limited functionality for internal materials. Graphics image search. New
publishing tools will incorporate a search of externally content.
Searching for external users
Site Search: Run by AOL. CMS produces and publishes (restricted by IP) XML files
for every story. At set intervals AOL picks up the XML files uses those files to
produce CNN.com’s internal search results.
Web Search: Powered by Google. Sponsored links from Overture. Both sets of
results are returned to CNN.com in XML feeds published on a CNN.com
template.
Video/multimedia search: Exploring
Technology
Workflow
Panelist: Olivia Kobelt
Workflow Overview








Types of web content – what do we archive?
Archiving old content
Internal vs. external archive
Making corrections/fixes
Search ability
Current workflow
Systems we use
Future Vision
Brainstorming!
Break!
Be back in 15 minutes!
Brainstorming Recap!
Legal
compliance vs. business user or need
Copyright – can you archive someone else’s content, partner content?
Talking to IT about what the requirements are
How do you approach gathering user requirements?
Who are users?
What are retention criteria? (date, size of files, originals/drafts/versioning,
exclude search, business value)
Hierarchy starting at bottom with knowledge, corporate, business use/reuse,
compliance, vital records
How to capture and keep the hybrid web pages?
What software applications are available?
Microfilm archiving?
What tools are available to automate the archival process?
Where do we begin? Seeking advice in relation to storage, retrieval, technology,
etc.
What type of information/literature is available on the topic of archiving web data?
Selling the idea to management
Archiving “how it looked”
How did we do it? Examples of how a project was done.
Measure what people are trying to find in older files
Managing the customer service side of it
Role of the Librarian
Panelist: Barry Absich
Librarian Role Overview







You are the expert.
What do you need?
What do readers need?
A news Web site has as much in common with a
library as it does with a newspaper.
Become familiar with your newspaper’s Web site.
If it is politically correct, insist that you be consulted
on all matters relating to both archiving and
searching.
If you can't insist, at least offer your services. Odds
are, your online editor will welcome the offer.
Building A Business Case
Panelist: Mark Stencel
Business Case Overview





What’s worth saving
Making money
Indirect revenue
Costs and challenges
Getting credit
Business Case
Does It Pay To Save?
Key points:
 Your news organization can profit from its
archive of original online content
 Making money isn’t always profitable (your
business case should account for the cost of
doing business, not just revenue)
Business Case
Original Content







Breaking news stories
Standing text (FAQs, online guides and
primers)
Video/Audio
Photo Galleries
E-mail Newsletters
Interactive Discussions/Chats
Databases (listings, scores)
Business Case
Making Money


Sponsorships (e.g., local visitor guides)
Resale (paid archives; research services,
such as LexisNexis, Factiva; online reprint
rights)
Note: Few good models for selling non-text
content (video, audio, galleries)
Business Case
Business Case
Business Case
Business Case
Business Case
Indirect Revenue


Promotion (can archived content attract more
online users or even print or online
subscribers?)
Registration (will users provide valuable email addresses or other personal information
in exchange for access to content)
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Costs and Challenges


Do systems, process, equipment or
personnel cost more than you can make?
Rights Management (which content do you
have legal rights to use, re-use, or re-sell
online)
Content Management (publishing systems
and file/directory management for keeping
track of where your content is)
Business Case
Costs and Challenges (cont’d.)



Fulfillment and Customer Service (supporting
services you provide to the public or to
partners)
Revenue Shares (accounting for your
partner’s shares)
Coordinating With Parents or Siblings (do
your plans fit in or conflict with the overall
business goals/strategies of your chain?)
Business Case
Costs and Challenges (cont’d.)


Hosting (server space, streaming)
Un-hosting (time and effort to delete or delink content; automatically deleting content
vs. selectively maintaining content)
Business Case
Get Credit!

Make sure your department gets credit for
any revenue it generates, not just the bill for
the cost of providing money-making content
and services.
Business Case
Questions & Answers
Closing remarks
Please complete an evaluation form.
Suggested Resources






“The Archival Black Hole” by Scott Kirsner, 9/19/98,
Editor & Publisher
"Archiving the Internet" by Brewster Kahle
11-4-96 From the Scientific American
Nothing But Net, Preserving the Internet, 1 Terabyte
at a Time by Bill Barnes, Slate.msn.com
"It Was Here a Minute Ago!": Archiving the Net
By Susan E. Fledman, Searcher: The Magazine for
Database Professionals
SCC systems archiving billions of bytes at
newspapers Newspapers & Technology March 2000
http://www.archive.org
Download