Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson ... [et al.] Circulation Counter [RES3H] ZA4080 .D63 1998 Chapter 1 Document Lifecycle What is a document? A document records a message from people to people. Characteristics of a document • Content • Structure • Metadata Metadata • A message has a context, which is important for understanding the message. • A document contains not only the contents of a message, but also some information about the document, e.g. author, date, recipients. • We called such information the metadata about the document. Why Document Management? • It is hard to find documents. • It is hard to organize documents. • It is hard to control documents. • Metadata helps document management. Benefits of Document Management • Location-independent delivery of documents upon demand • Controlled access to documents • A record of the life of a document • Better re-use of documents Chapter 2 Electronic Document Description Document Content • Simplest type of content – unformatted text • Text retrieval system based on search by keywords • E.g Windows Desktop Search (video) • Optical character recognition (OCR) system Document Structure • Even unformatted text has some structures, e.g. lines, words, images, etc. • A document may have elaborate structures. • Two levels of structures: – Logical structure – Presentational structure Logical structures • Example: TO: John D. FROM: Kate M. DATE: 7/8/98 I have finished Stage B of the design. Could you take a look at it? • Simple logical structure: lines of text • A logical structure of a memo: (see next slide) A logical structure for a memo Memo Head Sender Receiver Body Date Paragraph Presentational Structure • A different presentational structure for the same memo John D., 7/8/98 I have finished Stage B of the design. Could you take a look at it? Kate M. Presentation medium • The content of the same document can be presented in different media with different presentational structures: • E.g. a PDF file vs. a online Web page Metadata • Generally, we need metadata to capture: – Registration information – Usage information – Structural properties – Contextual information – Content description – Historical information The Dublin Core metadata set • • • • • • • • Title Creator Subject Description Publisher Contributors Date Type • Format: e.g. HMTL, pdf • Identifier: e.g. URI • Source • Language • Relation • Coverage: duration • Rights: e.g. copyright Document Description Language (DDL) • For use by document management system • E.g. RTF, Postcript, SGML • DDL support: – Language support, media support, transparency, structure, link support, metadata support • Other DDL characteristics: – Document creation, import conversion, export transformation, update, presentation quality, presentation flexibility, etc. Examples of DDLs • ASCII (American Standard Code for Information Interchange) • Unicode • ASCII and Unicode offer very limited support • Rich Text Format • TeX and LaTeX • SGML, HTML, XML • Postscript, PDF Rich Text Format (RTF) • Developed by Microsoft • For interchange between Microsoft Word and other software • Main purposes: – Preserve information in Word (blocks of text) • Example: next slide {\rtf1\adeflang1025\ansi\ansicpg1252\uc2\adeff0\deff0\stshfdbch13\stshfloch0\stshf hich0\stshfbi0\deflang2057\deflangfe1028{\fonttbl{\f0\froman\fcharset0\fprq2{\*\pan ose 02020603050405020304}Times New Roman … {\title John D}{\author Dr. Yeung}{\operator Dr. Yeung}{\creatim\yr2008\mo3\dy18\hr15\min24}{\revtim\yr2008\mo3\dy18\hr15\mi n25}{\version1}{\edmins1}{\nofpages1}{\nofwords14}{\nofchars81}{\*\company Lingnan University}{\nofcharsws94} … \ltrch\fcs0 \insrsid1782868\charrsid1782868 \hich\af0\dbch\af13\loch\f0 John D., 7/8/98 \par \hich\af0\dbch\af13\loch\f0 I have finished Stage B of the design. Could you take a look at it? \par \par \hich\af0\dbch\af13\loch\f0 Kate M\hich\af0\dbch\af13\loch\f0 . \par }\pard \ltrpar\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid4811147 \par }} TeX and LaTeX • TeX created by Donald Knuth • TeX is a typesetting software. • LaTeX created based on TeX by Leslie Lamport • LaTeX use markup constructs to separate logical description from presentation. • LaTeX example: see next slide • To learn LaTeX: click. \documentclass{article} \usepackage{times} \pagestyle{empty} \begin{document} \title{Sample Document} \author{ W. L. Yeung\\Department of Computing and Decision Sciences\\ Lingnan University, Hong Kong\\wlyeung@ln.edu.hk} \maketitle \section{Introduction} … \section{Conclusion} … \end{document} SGML • Standard Generalized Markup Language • To describe a document in SGML, we need: – An SGML declaration – A document type definition (DTD) – A document instance • An SGML declaration specifies which characters are used in the DTD. Normally a default is used. SGML (cont.) • A document type definition (DTD) defines the rules for forming a class of documents, i.e. the grammar of a document class. • The building blocks of SGML documents are elements. • A DTD for the memo document: next slide. <!-– DTD for office memo --> <!-- ELEMENT CONTENT -- > <!ELEMENT memo - - (head, body, close?) > <!ELEMENT head 0 0 (to & from & date) > <!ELEMENT to - - (#PCDATA) > <!ELEMENT from - - (#PCDATA) > <!ELEMENT date - - (#PCDATA) > <!ELEMENT body - - (#PCDATA) > <!ELEMENT par - - (#PCDATA) > <!ELEMENT close - - (#PCDATA) > <!-- ELEMENT NAME VALUE DEFAULT -- > <!ATTLIST memo status (con|pub) pub > <!ATTLIST par id id #IMPLIED > DTD • An element definition gives the name of the element, then the rules for building that element. • Elements can contain other elements. • Terminal (basic) elements often consist of parsed character data “#PCDATA” or “#CDATA”. The memo in SGML <MEMO> <TO> John D </TO> <FROM> Kate M </FROM> <DATE> 7/8/1998 </DATE> <BODY> <PAR> I have finished Stage B of the design. </PAR> </BODY> </MEMO> HTML • • • • Hypertext Markup Language For World Wide Web (WWW) documents Conforms to a SGML DTD HTML is presentation oriented: instructions (tags) are inserted into a document to for presentation effects • The DTD for HTML is available on http://www.w3.org/TR/html401/sgml/dtd.html The memo in HTML <!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML//EN”> <HTML> <HEAD> <TITLE>Memo</TITLE> <META NAME=“DC.AUTHOR” CONTENT=“Kate M”</META> <META NAME=“DC.DATE” CONTENT=“7/8/1998”</META> </HEAD> <BODY> <H1>Memo</H1> <P>I have finished Stage B of the <A HREF=“/team3/design2”>design<A>. </P> </BODY> </HTML> XML • Extensible Markup Language • Three basic definitions: – XML for representing data and documents – XLink and XPointer for representing interdocument linking – XSL for representing presentation • XML is a near-subset of SGML XML (Cont.) • Two classes of XML documents: – Valid XML documents: documents that conform to a specific supplied DTD – Well-formed documents: only satisfy a simple default grammar, without conforming to a specific DTD • XML has become the cornerstone of electronic commerce as it allows businesses to exchange electronic documents according to some standard formats based on XML. Postscript • Developed by Adobe • For representing documents that are to be printed (mainly on laser printers) • A page description language optimized for printing text, images, graphics. Portable Document Format (PDF) • Developed by Adobe • A page description language for representing text, graphics and images • A PDF file contains presentation information on pages, annotations, links, fonts, etc. • Support delivery of electronic documents exactly as they would appear in printed form. • Not designed for editing or document format exchange.