Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003 Introductions Meet today’s panelists Today’s Panelists Barry Abisch - The Journal News Olivia Kobelt – Christian Science Monitor Mark Stencel – Washingtonpost.com Janine Yagielski – CNN.com Agenda Introductions & session overview Technology Workflow and processes Brainstorming Break Brainstorming recap Role of the librarian Building the business case Closing comments & session evaluation Technology Panelist: Janine Yagielski Technology Overview What can be archived? Preparing content to be archived Storing and serving archived content Searching archived content What can be archived? Overview of file formats (handout) Dynamic and static content Archiving presentation as well as content Archiving secondary information about online content (traffic information) Challenges of changing technologies Technology Overview of File Formats (handout) Text formats Image/graphic file formats Video formats Other definitions Technology Static and Dynamic Content Static Content: Content that once posted does not change. Example: Simple story or information page Dynamic Content: Constantly changing content Example 1: Weather data, Stock Prices Example 2: Election Results, Sports Scores (fixed end point) Technology Static and Dynamic Content Hybrid: Changes occasionally but does not have a predictable updating schedule or end point Example: Top story with multiple and significant updates Example: Home page or section page Technology Archiving Presentation and Content CNN.com has built an internal system to archive some presentation Home Page, US, World, Politics, International Edition One week of pages Every 30 minutes Perl Script Size of archive: 55.4 MB Technology Archiving Secondary Information about Online Content (traffic) CNN.com has extensive Webstats reporting system that parses and archives the information from Web server logs. Simple statistics: Page Views, hits (back to 1996) Advanced statistics: Unique users, time spent, IP address, OS, browser Real Time Monitor: tracks click through rates of links Home and US pages One week of info on links Tracks average and peak for links Technology Challenges of Changing Technology Interdependencies of the Web make it difficult to maintain old content when optimizing for new content. Examples: .shtml pages, Vivo video, some Shockwave, other antiquated multimedia technology based on plug-ins Technology Preparing Content to be Archived Directory Structure/Database Key to consistency and automation in subject specific archives. cnn.com/2003/WORLD/meast/06/02/sprj.nitop.political.council/ Slugs conventions Provide additional method of automation archiving Examples: sprj; sprj.nitop; .ap Technology Preparing Content to be Archived Content Management System Imposes and uses directory structure to prepare content for publication, syndication and in some cases archiving and searching Metadata in stories on publish <meta name="DESCRIPTION" content="A U.S. soldier was killed and five were wounded early Thursday in the Iraqi city of Fallujah, the U.S. Central Command announced -- the latest casualties in the city, which has become a center of resistance."> <meta name="AUTHOR" content=""> <meta name="SECTION" content="WORLD"> <meta name="SUBSECTION" content="meast"> <meta name="DATE" content="2003-06-05 05:22:20"> Technology Preparing Content to be Archived XML (Extensible Markup Language) CNN.com produces a XML file with every story for site search. We also produce XML feeds of story headlines and other data sent to syndication partners. Metadata and XML for Multimedia CNN.com is looking into way to insert metadata and produce XML feeds of non-traditional stories. Currently only an internal and manual process of archiving the location and subject of interactive (pop-up) content. Technology Storing and Serving Archived Content Simple storage of content Content servers Burn to CD Web servers (internal and external even if not served) Tape backup Serving to internal users Image query Directory browsing on the inside Web servers Content purged from outside available (AP, partner stories) Limited space on internal Web server (36 GB) Technology Serving to All External Users All unique URLs published on CNN.com from the launch of the site are still available, unless there was an editorial decision to remove or redirect a URL. CNN video is hosted by AOL. Because of changes in hosting and capacity of video servers. Not all previous video streams are available. Technology Serving to All External Users Web servers/NFS Server Hardware: Sun and Intel (running Linux) Cost: $10,000-$15,000 (Sun), $5,000 (Intel) Capacity: Storage capacity expanded by adding additional hard drives. Serving capacity varies by content. HTML -- 25K hits/minute; images, style sheets -- 60-70K hits/minute Video Servers Hardware: Reconfigured and video dedicated Web server Cost : $1,500-$3,000 Capacity: Depends on length and size of video and disk space Technology Serving to Select Users Registration E-mail newsletters New e-mail alerts Backend Oracle database JSP’s dynamically served Subscription Video Real Networks handles CNN.com’s subscriber authentication Technology Searching Archived Content Searching for internal users Limited functionality for internal materials. Graphics image search. New publishing tools will incorporate a search of externally content. Searching for external users Site Search: Run by AOL. CMS produces and publishes (restricted by IP) XML files for every story. At set intervals AOL picks up the XML files uses those files to produce CNN.com’s internal search results. Web Search: Powered by Google. Sponsored links from Overture. Both sets of results are returned to CNN.com in XML feeds published on a CNN.com template. Video/multimedia search: Exploring Technology Workflow Panelist: Olivia Kobelt Workflow Overview Types of web content – what do we archive? Archiving old content Internal vs. external archive Making corrections/fixes Search ability Current workflow Systems we use Future Vision Brainstorming! Break! Be back in 15 minutes! Brainstorming Recap! Legal compliance vs. business user or need Copyright – can you archive someone else’s content, partner content? Talking to IT about what the requirements are How do you approach gathering user requirements? Who are users? What are retention criteria? (date, size of files, originals/drafts/versioning, exclude search, business value) Hierarchy starting at bottom with knowledge, corporate, business use/reuse, compliance, vital records How to capture and keep the hybrid web pages? What software applications are available? Microfilm archiving? What tools are available to automate the archival process? Where do we begin? Seeking advice in relation to storage, retrieval, technology, etc. What type of information/literature is available on the topic of archiving web data? Selling the idea to management Archiving “how it looked” How did we do it? Examples of how a project was done. Measure what people are trying to find in older files Managing the customer service side of it Role of the Librarian Panelist: Barry Absich Librarian Role Overview You are the expert. What do you need? What do readers need? A news Web site has as much in common with a library as it does with a newspaper. Become familiar with your newspaper’s Web site. If it is politically correct, insist that you be consulted on all matters relating to both archiving and searching. If you can't insist, at least offer your services. Odds are, your online editor will welcome the offer. Building A Business Case Panelist: Mark Stencel Business Case Overview What’s worth saving Making money Indirect revenue Costs and challenges Getting credit Business Case Does It Pay To Save? Key points: Your news organization can profit from its archive of original online content Making money isn’t always profitable (your business case should account for the cost of doing business, not just revenue) Business Case Original Content Breaking news stories Standing text (FAQs, online guides and primers) Video/Audio Photo Galleries E-mail Newsletters Interactive Discussions/Chats Databases (listings, scores) Business Case Making Money Sponsorships (e.g., local visitor guides) Resale (paid archives; research services, such as LexisNexis, Factiva; online reprint rights) Note: Few good models for selling non-text content (video, audio, galleries) Business Case Business Case Business Case Business Case Business Case Indirect Revenue Promotion (can archived content attract more online users or even print or online subscribers?) Registration (will users provide valuable email addresses or other personal information in exchange for access to content) Business Case Business Case Business Case Business Case Business Case Business Case Business Case Business Case Business Case Costs and Challenges Do systems, process, equipment or personnel cost more than you can make? Rights Management (which content do you have legal rights to use, re-use, or re-sell online) Content Management (publishing systems and file/directory management for keeping track of where your content is) Business Case Costs and Challenges (cont’d.) Fulfillment and Customer Service (supporting services you provide to the public or to partners) Revenue Shares (accounting for your partner’s shares) Coordinating With Parents or Siblings (do your plans fit in or conflict with the overall business goals/strategies of your chain?) Business Case Costs and Challenges (cont’d.) Hosting (server space, streaming) Un-hosting (time and effort to delete or delink content; automatically deleting content vs. selectively maintaining content) Business Case Get Credit! Make sure your department gets credit for any revenue it generates, not just the bill for the cost of providing money-making content and services. Business Case Questions & Answers Closing remarks Please complete an evaluation form. Suggested Resources “The Archival Black Hole” by Scott Kirsner, 9/19/98, Editor & Publisher "Archiving the Internet" by Brewster Kahle 11-4-96 From the Scientific American Nothing But Net, Preserving the Internet, 1 Terabyte at a Time by Bill Barnes, Slate.msn.com "It Was Here a Minute Ago!": Archiving the Net By Susan E. Fledman, Searcher: The Magazine for Database Professionals SCC systems archiving billions of bytes at newspapers Newspapers & Technology March 2000 http://www.archive.org